Understanding skilled human activity from first-person perspectives through large-scale multimodal datasets and foundational models.
Automatic segmentation of generic objects in videos by combining motion and appearance cues using deep learning.
Interactive systems that leverage human input through clicks and annotations for efficient segmentation.
Building and curating multimodal datasets for training foundational models in computer vision and embodied AI.