Seeing Touch from Motion

A Unified Modality-Aware Visuo-Tactile Policy
with Tactile Motion Correlation

ECCV 2026

Shengqi Xu1,2,3, Guojin Zhong1,2, Yang Liu1,2, Fanjie Wang1,2, Hu Luo1,2, Hanyu Zhou4, Weiyao Zhang1,2, Ziyi Ye1,2, Zuxuan Wu1,2,3,*, Yu-Gang Jiang1,2,*
1Institute of Trustworthy Embodied AI, Fudan University   2Shanghai Key Laboratory of Multimodal Embodied AI   3NeoteAI   4School of Computing, National University of Singapore
*Corresponding author
Tactile Motion Correlation overview

Tactile Motion Correlation (TMC). Fine-grained contact states such as making contact and releasing contact look nearly identical in raw tactile images and even in cumulative motion. By correlating transient and cumulative motion (same vs. opposite direction), TMC explicitly resolves this ambiguity, providing highly discriminative cues for contact-rich manipulation.

Abstract

Visuo-tactile policies leveraging optical tactile sensors have shown great promise in contact-rich manipulation. These sensors achieve high spatial resolution and multi-dimensional force sensing by using an internal camera to monitor the deformation of their elastic gel surface, thereby indirectly inferring tactile cues. Despite their advantages, extracting the fine-grained contact states necessary for contact-rich manipulation remains an open challenge. Existing methods typically use either raw images or cumulative motion fields, both of which are prone to perception ambiguity: raw images mainly capture appearance changes, while cumulative motion only reflects aggregate gel deformation. As a result, distinct contact states can exhibit highly similar patterns.

To address this, we explore the dynamic priors of tactile motion and discover that the correlation between transient and cumulative motion can explicitly distinguish fine-grained contact states. Based on this insight, we propose a motion-aware tactile representation, Tactile Motion Correlation (TMC). Beyond representation, effective fusion of tactile and visual modalities is also critical. We take advantage of the Mixture-of-Transformers (MoT) architecture and propose ViTacMotor, a unified yet modality-aware visuo-tactile policy that captures cross-modal complementarity while preserving modality-specific properties. Extensive experiments on four challenging contact-rich manipulation tasks demonstrate the superior performance of our method.

Method

ViTacMotor method overview

Overview of ViTacMotor. (a) TMC models the correlation between transient and cumulative tactile motion through their dot product to explicitly distinguish fine-grained contact states. (b) A unified yet modality-aware visuo-tactile fusion framework built on the Mixture-of-Transformers architecture captures cross-modal complementarity while preserving the unique properties of each modality.

Why Tactile Motion Correlation Works

Analysis of tactile motion correlation

Analysis of TMC properties (Daimon sensor). (a) Raw images and cumulative motion are ambiguous across contact states, while the transient–cumulative correlation is highly discriminative. (b) The dot product cleanly separates contact states in 3D distribution. (c) The dot-product magnitude is positively correlated with contact force.

  • Making contact: transient and cumulative motion align → positive dot product.
  • Releasing contact: the two motions oppose → negative dot product.
  • Stable contact / no contact: transient motion vanishes → dot product near zero.
  • Sliding: clear positive–negative spatial separation reveals the sliding direction.

Hardware Setup & Tasks

Real-world experimental setup

Real-world setup. (a) A 6-DoF Agilex arm with a 1-DoF gripper, two cameras (wrist & third-view), and two optical tactile sensors. (b) Two sensors, Daimon and Xense, with distinct markers and gels. (c) Four contact-rich tasks: tube collection, whiteboard erasing, lightbulb insertion, and pencil sharpening.

Video Comparisons

Side-by-side rollouts: Baseline vs. ViTacMotor (Ours) on each contact-rich task.

Task 1: Tube Collection

Baseline

ViTacMotor (Ours)

Task 2: Whiteboard Erasing

Baseline

ViTacMotor (Ours)

Task 3: Lightbulb Insertion

Baseline

ViTacMotor (Ours)

Task 4: Pencil Sharpening

Baseline

ViTacMotor (Ours)

Qualitative Results

Qualitative policy execution results

Policy execution. ViTacMotor with two tactile sensors (Daimon and Xense) on four contact-rich manipulation tasks: tube collection, whiteboard erasing, lightbulb insertion, and pencil sharpening.

Quantitative Comparison

Success rate (%) over 15 trials. Best per column in bold; numbers in parentheses are training demos.

Tasks Requiring In-Hand State Information
Method Tube Collection (35) Lightbulb Insertion (60)
Grasp1st Hole2nd HoleWhole AlignInsertWhole
ACT*66.760.053.353.360.013.313.3
DP*73.353.340.040.053.36.76.7
ACT + T86.760.060.060.060.026.726.7
Policy Consensus80.053.353.353.353.326.726.7
TactileACT93.373.360.060.060.040.040.0
ViTacMotor93.380.073.373.360.040.040.0
Tasks Requiring Fine-Grained Force Control
Method Whiteboard Erasing (50) Pencil Sharpening (40)
Grasp1st Erase2nd EraseWhole SharpenHolderWhole
ACT*100.060.046.746.740.033.333.3
DP*100.066.740.040.033.333.333.3
ACT + T100.080.060.060.040.040.040.0
Policy Consensus93.373.366.766.733.333.333.3
TactileACT100.086.773.373.346.746.746.7
ViTacMotor100.086.786.786.760.060.060.0

Ablation Study

BaseTMCMoTWhite. EraseTube Coll.
46.753.3
73.366.7
66.760.0
86.773.3

TMC in Existing Policies

Tactile RepresentationACTDP
Raw Image60.053.3
Cumulative Motion53.340.0
TMC (Ours)73.366.7

Robustness

Robustness to environment and object variations

Robustness to environment and object variations. Under temporally varying, spatially (non-)uniform lighting and object changes (different erasers/inks, a cylinder in place of the tube), ViTacMotor still completes the tasks — thanks to TMC's motion-based, appearance-agnostic design.

BibTeX

@inproceedings{vitacmotor2026,
  title     = {Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile
               Policy with Tactile Motion Correlation},
  author    = {Xu, Shengqi and Zhong, Guojin and Liu, Yang and Wang, Fanjie and
               Luo, Hu and Zhou, Hanyu and Zhang, Weiyao and Ye, Ziyi and
               Wu, Zuxuan and Jiang, Yugang},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}