Dense Scene Reconstructions
Photogrammetric scans of each recording environment, delivered as point clouds and aligned to ego-camera coordinates for direct reprojection.
Every session is captured simultaneously from two synchronized head-mounted cameras, paired with two exocentric views as well as dense scene scans and high-resolution object scans — enabling a new class of multi-view egocentric tasks.
CoMind is a large-scale egocentric dataset designed for studying human–scene–object interaction from paired first-person viewpoints. Unlike prior datasets with a single wearable camera, every recording in CoMind captures the same moment from two synchronized ego cameras — opening up new benchmarks for cross-view consistency, 3D reasoning, and collaborative activity understanding.
Every session is recorded by two participants wearing synchronized capture glasses. Both streams are frame-aligned, hardware-timestamped, and come with shared scene & object scans.
Beyond video, every session ships with the 3D context it was captured in.
Photogrammetric scans of each recording environment, delivered as point clouds and aligned to ego-camera coordinates for direct reprojection.
Every interacted object is independently scanned at sub-millimeter resolution, enabling 6-DoF pose supervision and rendering-based evaluation.
Tasks constructed using realistic social scenarios, requiring complex reasoning about social cues in the provided speech transcripts and context frames.
Given two synchronized ego views at a moment of joint attention, the model predicts the category of the jointly attended object, its bounding box in the left and right views, and the type of social cue that can be used to infer which object is being attended to.
From context frames and transcribed audio together with a prediction frame, the model predicts the noun and verb of the action to be performed, the interacted object's bounding box, and the type of social cue that can be used to infer the next interacted object.
From context frames and transcribed audio together with a prediction frame showing a moment directly preceding a handover event, the model predicts the delivery flow (who hands to whom), the object category and bounding box in the view of the handing participant, the initiator, and the cue type.
Coming soon. Please stay tuned!
@article{comind2026,
title = {{CoMind: Understanding Collaborative Human Activity from Multiple Minds and Views}},
author = {Gavryushin, Alexey and Zhang, Dingxi and Huang, Zhao and Delitzas, Alexandros and Chen, Jiaqi and Ellis, Ben and Z{\"o}llner, Cedric and Patel, Manthan and Kaufmann, Manuel and Pollefeys, Marc and Wang, Xi},
year = {2026}
}