The ReWiND technique, which includes 3 stages: discovering a benefit feature, pre-training, and making use of the incentive feature and pre-trained plan to find out a brand-new language-specified job online.
In their paper ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, which existed at CoRL 2025, Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh A. Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Bıyık and Jesse Zhang present a structure for discovering robotic control jobs exclusively from language guidelines without per-task demos. We asked Jiahui Zhang and Jesse Zhang to inform us even more.
What is the subject of the study in your paper, and what trouble were you intending to address?
Our study addresses the trouble of making it possible for robotic control plans to address unique, language-conditioned jobs without gathering brand-new demos for each and every job. We start with a little collection of demos in the release atmosphere, train a language-conditioned incentive version on them, and after that make use of that discovered incentive feature to adjust the plan on hidden jobs, without any added demos needed.
Inform us regarding ReWiND– what are the highlights and payments of this structure?
ReWiND is a straightforward and reliable three-stage structure created to adjust robotic plans to brand-new, language-conditioned jobs without gathering brand-new demos. Its highlights and payments are:
- Compensate feature knowing in the release atmosphere
We initially find out a benefit feature making use of just 5 demos per job from the release atmosphere.- The incentive version takes a series of pictures and a language guideline, and anticipates per-frame progression from 0 to 1, offering us a thick incentive signal as opposed to thin success/failure.
- To subject the version to both effective and fell short actions without needing to accumulate unsuccessful habits demos, we present a video clip rewind enhancement: For a video clip division V( 1: t), we select an intermediate factor t1. We turn around the section V( t1: t) to produce V( t: t1) and add it back to the initial series. This creates an artificial series that looks like “making progression after that downfall progression,” efficiently imitating fell short efforts.
- This permits the incentive version to find out a smoother and extra precise thick incentive signal, enhancing generalization and security throughout plan knowing.
- Plan pre-training with offline RL
As soon as we have actually the discovered incentive feature, we utilize it to relabel the little presentation dataset with thick progression incentives. We after that educate a plan offline making use of these relabeled trajectories. - Plan fine-tuning in the release atmosphere
Lastly, we adjust the pre-trained plan to brand-new, hidden jobs in the release atmosphere. We ice up the incentive feature and utilize it as the responses for on the internet support knowing. After each episode, the freshly gathered trajectory is relabeled with thick incentives from the incentive version and contributed to the replay barrier. This repetitive loophole permits the plan to continuously enhance and adjust to brand-new jobs without needing any type of added demos.
Could you speak about the experiments you accomplished to evaluate the structure?
We examine ReWiND in both the MetaWorld simulation atmosphere and the Koch real-world arrangement. Our evaluation concentrates on 2 elements: the generalization capability of the incentive version and the efficiency of plan knowing. We additionally contrast just how well various plans adjust to brand-new jobs under our structure, showing considerable renovations over advanced approaches.
( Q1) Compensate generalization– MetaWorld evaluation
We accumulate a metaworld dataset in 20 training jobs, each job consist of 5 demonstrations, and 17 relevant however hidden jobs for analysis. We educate the incentive feature with the metaworld dataset and a part of the OpenX dataset.
We contrast ReWiND to LIV[1], LIV-FT, RoboCLIP[2], VLC[3], and GVL[4] For generalization to hidden jobs, we make use of video clip– language complication matrices. We feed the incentive version video clip series coupled with various language guidelines and anticipate the properly matched video clip– guideline sets to get the highest possible anticipated incentives. In the complication matrix, this represents the angled entrances having the toughest (darkest) worths, showing that the incentive feature accurately recognizes the right job summary also for hidden jobs.
Video-language incentive complication matrix. See the paper to learn more.
For trial positioning, we determine the relationship in between the incentive version’s anticipated progression and the real time action in effective trajectories making use of Pearson r and Spearman ρ For plan rollout position, we examine whether the incentive feature properly places fallen short, near-success, and effective rollouts. Throughout these metrics, ReWiND considerably outshines all standards– as an example, it attains 30% greater Pearson relationship and 27% greater Spearman relationship than VLC on trial positioning, and provides regarding 74% family member renovation in incentive splitting up in between success groups compared to the toughest standard LIV-FT.
( Q2) Plan knowing in simulation (MetaWorld)
We pre-train on the exact same 20 jobs and after that examine RL on 8 hidden MetaWorld jobs for 100k atmosphere actions.
Utilizing ReWiND incentives, the plan attains an interquartile mean (IQM) success price of roughly 79%, standing for a ~ 97.5% renovation over the most effective standard. It additionally shows significantly much better example performance, accomplishing greater success prices a lot previously in training.
( Q3) Plan knowing in actual robotic (Koch bimanual arms)
Configuration: a real-world tabletop bimanual Koch v1.1 system with 5 jobs, consisting of in-distribution, aesthetically chaotic, and spatial-language generalization jobs.
We make use of 5 demonstrations for the incentive version and 10 demonstrations for the plan in this even more tough setup. With regarding 1 hour of real-world RL (~ 50k env actions), ReWiND enhances typical success from 12% → 68% (≈ 5 × renovation), while VLC just goes from 8% → 10%.
Are you intending future job to more enhance the ReWiND structure?
Yes, we intend to prolong ReWiND to bigger versions and more enhance the precision and generalization of the incentive feature throughout a wider series of jobs. Actually, we currently have a workshop paper expanding ReWiND to larger-scale versions.
On top of that, we intend to make the incentive version efficient in straight anticipating success or failing, without depending on the atmosphere’s success signal throughout plan fine-tuning. Presently, despite the fact that ReWiND supplies thick incentives, we still count on the atmosphere to suggest whether an episode has actually succeeded. Our objective is to create a totally generalizable incentive version that can give both precise thick incentives and trustworthy success discovery by itself.
Recommendations
[1] Yecheng Jason Ma et al. “Liv: Language-image depictions and incentives for robot control.” International Seminar on Artificial Intelligence PMLR, 2023.
[2] Sumedh Sontakke et al. “Roboclip: One presentation suffices to find out robotic plans.” Breakthroughs in Neural Data Processing Equipment 36 (2023 ): 55681-55693.
[3] Minttu Alakuijala et al. “Video-language movie critic: Transferable incentive features for language-conditioned robotics.” arXiv:2405.19988 (2024 ).
[4] Yecheng Jason Ma et al. “Vision language versions are in-context worth students.” The Thirteenth International Seminar on Knowing Representations 2024.
Regarding the writers
|
Jiahui Zhang is a Ph.D. trainee in Computer technology at the College of Texas at Dallas, encouraged by Prof. Yu Xiang. He obtained his M.S. level from the College of Southern The Golden State, where he dealt with Prof. Joseph Lim and Prof. Erdem Bıyık. |
|
Jesse Zhang is a postdoctoral scientist at the College of Washington, encouraged by Prof. Dieter Fox and Prof. Abhishek Gupta. He finished his Ph.D. at the College of Southern The golden state, encouraged by Prof. Jesse Thomason and Prof. Erdem Bıyık at USC, and Prof. Joseph J. Lim at KAIST. |
发布者:Lucy Smith,转转请注明出处:https://robotalks.cn/teaching-robot-policies-without-new-demonstrations-interview-with-jiahui-zhang-and-jesse-zhang/