Egocentric Vision · Benchmark · 2026

CoMind
A Dual-View Egocentric Dataset for Interaction Understanding

Every session is captured simultaneously from two synchronized head-mounted cameras, paired with two exocentric views as well as dense scene scans and high-resolution object scans — enabling a new class of multi-view egocentric tasks.

A
B
Kitchen — Meal Prep
A
B
Workshop — Assembly
A
B
Office — Desk Org
0
Sessions
0
Ego Videos
0
Scene Scans
0
Object Scans
0
Benchmark Tasks
01 — Overview

A new lens on first-person activity

CoMind is a large-scale egocentric dataset designed for studying human–scene–object interaction from paired first-person viewpoints. Unlike prior datasets with a single wearable camera, every recording in CoMind captures the same moment from two synchronized ego cameras — opening up new benchmarks for cross-view consistency, 3D reasoning, and collaborative activity understanding.

  • Dual Ego Views — two time-synchronized head-mounted cameras per session.
  • Scene Scans — dense 3D reconstructions of every recording environment.
  • Object Scans — high-resolution meshes for all interacted objects.
  • Rich Annotations — actions, hand/object states, gaze, and 6-DoF poses.
02 — Signature Feature

Two Egocentric Views, One Moment

Every session is recorded by two participants wearing synchronized capture glasses. Both streams are frame-aligned, hardware-timestamped, and come with shared scene & object scans.

Ego View — Subject A
Ego View — Subject B
00:00 / 00:00
03 — 3D Assets

Scene & Object Scans

Beyond video, every session ships with the 3D context it was captured in.

Scene scan preview
Scene Scan

Dense Scene Reconstructions

Photogrammetric scans of each recording environment, delivered as point clouds and aligned to ego-camera coordinates for direct reprojection.

  • Colored .ply
  • Camera extrinsics per frame
  • 53 unique environments
Object scan preview
Object Scan

High-Resolution Object Scans

Every interacted object is independently scanned at sub-millimeter resolution, enabling 6-DoF pose supervision and rendering-based evaluation.

  • Watertight meshes
  • Scale-accurate, oriented upright
  • 9 objects reused across recordings
04 — Benchmarks

Three Tasks Testing Social Reasoning

Tasks constructed using realistic social scenarios, requiring complex reasoning about social cues in the provided speech transcripts and context frames.

T1

Joint Attention Estimation

Given two synchronized ego views at a moment of joint attention, the model predicts the category of the jointly attended object, its bounding box in the left and right views, and the type of social cue that can be used to infer which object is being attended to.

Multi-viewGroundingObject Category PredictionSocial Cue Detection
T2

Socially Conditioned Object Interaction Anticipation

From context frames and transcribed audio together with a prediction frame, the model predicts the noun and verb of the action to be performed, the interacted object's bounding box, and the type of social cue that can be used to infer the next interacted object.

Single-ViewGroundingSocial Cue Detection
T3

Collaborative Handover Prediction

From context frames and transcribed audio together with a prediction frame showing a moment directly preceding a handover event, the model predicts the delivery flow (who hands to whom), the object category and bounding box in the view of the handing participant, the initiator, and the cue type.

Multi-viewGroundingSocial Cue Detection
05 — Access

Download

Coming soon. Please stay tuned!

06 — BibTeX

Citation

@article{comind2026,
  title     = {{CoMind: Understanding Collaborative Human Activity from Multiple Minds and Views}},
  author    = {Gavryushin, Alexey and Zhang, Dingxi and Huang, Zhao and Delitzas, Alexandros and Chen, Jiaqi and Ellis, Ben and Z{\"o}llner, Cedric and Patel, Manthan and Kaufmann, Manuel and Pollefeys, Marc and Wang, Xi},
  year      = {2026}
}