CoMind Dataset: Understanding Collaborative Human Activity from Multiple Minds and Views

01 — Video

Overview Video

A short walkthrough of CoMind: dual-view egocentric capture, scene and object scans, and the three benchmark tasks.

02 — Signature Feature

Two Egocentric Views, One Moment

Every session is recorded by two participants wearing synchronized capture glasses. Both streams are frame-aligned, hardware-timestamped, and come with shared scene & object scans.

Ego View — Subject A

Ego View — Subject B

00:00 / 00:00

03 — 3D Assets

Scene & Object Scans

Beyond video, every session ships with the 3D context it was captured in.

Loading 3D preview…

drag to rotate

Scene Scan

Dense Scene Reconstructions

Photogrammetric scans of each recording environment, delivered as point clouds and aligned to ego-camera coordinates for direct reprojection.

Colored .ply
Camera extrinsics
55 unique environments

Loading 3D preview…

drag to rotate

Object Scan

High-Resolution Object Scans

Interacted objects are independently scanned at sub-millimeter resolution, enabling applications involving 6-DoF pose extraction and rendering-based evaluation.

Scale-accurate meshes
Objects reused across recordings

04 — Benchmarks

Three Tasks Testing Social Reasoning

Tasks constructed using realistic social scenarios, requiring complex reasoning about social cues in the provided speech transcripts and context frames.

T1

Joint Attention Estimation

Given two synchronized ego views at a moment of joint attention, the model predicts the category of the jointly attended object, its bounding box in the left and right views, and the type of social cue that can be used to infer which object is being attended to.

Multi-viewGrounding in Two ViewsObject Category PredictionSocial Cue Detection

T2

Socially Conditioned Object Interaction Anticipation

From context frames and transcribed audio together with a prediction frame, the model predicts the noun and verb of the action to be performed, the interacted object's bounding box, and the type of social cue that can be used to infer the next interacted object.

Single-ViewGroundingObject Category PredictionAction PredictionSocial Cue Detection

T3

Collaborative Handover Prediction

From context frames and transcribed audio together with a prediction frame showing a moment directly preceding a handover event, the model predicts the delivery flow (who hands to whom), the object category and bounding box in the view of the handing participant, the initiator, and the cue type.

Multi-viewGroundingObject Category PredictionSocial Cue DetectionHandover Initiator DetectionHandover Flow Detection

05 — Leaderboards

Benchmark Leaderboards

Evaluate your model on the CoMind test set. Submit your test-set predictions for a task through its form below — we score them against the held-out annotations and add your entry here. All metrics are fractions in [0, 1]; higher is better (↑).

T1

Joint Attention Estimation

Submit predictions ↗

#	Method	L-IoU@0.5 ↑	R-IoU@0.5 ↑	Cue Type ↑	Cat. (L1) ↑
No submissions yet. Be the first to appear here.

T2

Socially Conditioned Object Interaction Anticipation

Submit predictions ↗

#	Method	IoU@0.5 ↑	Cue Type ↑	Act. Verb ↑	Act. Noun (L1) ↑
No submissions yet. Be the first to appear here.

T3

Collaborative Handover Prediction

Submit predictions ↗

#	Method	IoU@0.5 ↑	Del. Flow ↑	Initiator ↑	Init. Type ↑	Cat. (L1) ↑	TTH ↑
No submissions yet. Be the first to appear here.

06 — Access

Download

CoMind is distributed through a Python download script. Run it with the --help flag for usage, all options, and the full list of dataset components.

Download script

07 — BibTeX

Citation

@inproceedings{gavryushin2026eccv,
  author    = {Gavryushin, Alexey and Zhang, Dingxi and Huang, Zhao and Delitzas, Alexandros and Chen, Jiaqi and Ellis, Ben and Z{\"o}llner, Cedric and Patel, Manthan and Kaufmann, Manuel and Pollefeys, Marc and Wang, Xi},
  title     = {{CoMind}: Understanding Collaborative Human Activity from Multiple Minds and Views},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026},
  publisher = {Springer}
}