Documentation
ScanNet++ Toolbox
Check out our ScanNet++ Toolbox in github. It provides tools and code to

Read the dataset structure
Decode iPhone RGB, depth, and mask video
Undistort DSLR fisheye images into pinhole
Render high-res depth maps from the mesh for DSLR and iPhone frames
Nerfstudio dataparser for DSLR images is ready (see this PR)
Prepare training data for semantic tasks
Official evaluation code for the benchmark
3D Gaussian Splatting on ScanNet++
Check out our 3DGS example in github. It provides tools and code to visualize the frames and train 3D Gaussian Splatting on the scenes in ScanNet++. It also includes scripts to render images for submitting the NVS benchmark (DSLR).

Data Structure
The ScanNet++ dataset currently consists of 1006 scenes. The default download (low-res DSLR images, iPhone data, 3D meshes and semantics) occupies about 1.5 TB on disk.

Asset group	Download size
DSLR (2 MP), iPhone, Meshes and semantics (default download)	1.5 TB
DSLR (2 MP)	371 GB
DSLR 2MP + 33MP (hi-res)	9 TB
Meshes and semantics	132 GB
Point clouds	720 GB
Panocam	319 GB
The data download contains one folder per scene containing laser scan, DSLR and iPhone data, and several metadata files. The data is organized as follows:

split/
nvs_sem_train.txt: Training set for NVS and semantic tasks with 230 scenes
nvs_sem_val.txt: Validation set for NVS and semantic tasks with 50 scenes
nvs_test.txt: Test set for NVS with 50 scenes
nvs_test_small.txt: Smaller test set for NVS with 12 scenes, which is a subset of nvs_test.txt
nvs_test_iphone.txt: Test set for NVS with iPhone data with 12 scenes
sem_test.txt: Test set for semantic tasks with 50 scenes
Each file contains lists of Scene IDs in the respective split
metadata/
scene_types.json: scene ID to scene type mapping for all scenes
semantic_classes.txt: list of semantic classes
instance_classes.txt: subset of semantic classes that have instances (i.e., excludes wall, ceiling, floor, ..)
semantic_benchmark/
top100.txt: top 100 semantic classes for semantic segmentation benchmark
top100_instance.txt: subset of 100 semantic classes for instance segmentation benchmark
map_benchmark.csv: mapping from raw semantic labels to benchmark labels
data/<scene_id>/
scans/
pc_aligned.ply: point cloud from laser scanner, axis-aligned
pc_aligned_mask.txt: indices of anonymized points
scanner_poses.json: contains scanner positions, 4x4 transformation matrix for each position
mesh_aligned_0.05.ply: mesh decimated to 5% size, obtained from point cloud
mesh_aligned_0.05_mask.txt: indices of mesh vertices with anonymization applied
mesh_aligned_0.05_semantic.ply
The vertex “label” property contains the integer semantic label into the classes in semantic_classes.txt
Unlabeled vertices have the label -100
segments.json: json_data[“segIndices”] contains the segment ID for each vertex
segments_anno.json:
json_data[i] corresponds to a single annotated object
“label”: the semantic label of this object
“segments”: all the segments belonging to this object
dslr/
resized_images: Fisheye DSLR images, resized, JPG
resized_anon_masks: PNG. Specifies the pixels that have been anonymized (0: invalid, 255: valid pixels).
original_images: Full resolution images, JPG
original_anon_masks: PNG. Similar to resized masks
resized_undistorted_images: Undistorted DSLR images with the same resolution as the resized images, JPG
resized_undistorted_masks: PNG. Similar to resized masks
colmap: contains the colmap camera model that has been aligned with the 3D scans, which implies the poses are in metric scale. Make sure to use this if you want to do 2D-3D matching between the provided mesh.
cameras.txt: Contain the camera type (OPENCV_FISHEYE) and the intrinsic parameters (fx, fy, cx, cy, distortion parameters)
images.txt: Contain extrinsics of each image: qvec (quaternion) and tvec
points3D.txt: Contain 3D feature points used by COLMAP
Useful references:
Colmap docs for camera model
More info in colmap source code
Python reader provided by colmap
Python (Open3D) visualizer provided by COLMAP
nerfstudio/
transforms.json
Contains the same camera poses in the format used by Nerfstudio, OpenGL/Blender convention. The coordinate system is different from OpenCV/COLMAP convention.
poses:
frames, test_frames: contain poses for train and test images respectively
mask: filename of binary mask file
is_bad: indicates if the image is blurry or contains heavy shadows.
Camera model (as above):
contained in fl_x, fl_y, .., k1, k2, k3, k4, camera_model
The intrinsics are corresponds to the resized images
has_mask: global flag for the scene, indicating if it has anonymized masks or not
transforms_undistorted.json: similar to transforms.json but the undistorted DSLR version
train_test_lists.json
json[“train”]: training images
json[“test”]: novel views, test images
The split here is the same as the one in nerfstudio/transform.json
json[“has_masks”]: global flag for the scene, indicating if it has anonymized masks or not
iphone/
rgb.mkv: full RGB video, 60 FPS
rgb_mask.mkv: Video of anonymization masks, lossless compression. After decode, it's similar to the masks in DSLR.
depth.bin: Depth images as 16 bit png in millimeters in a single binary file from iPhone Lidar sensor. The depth images are aligned with the RGB images.
rgb: RGB frames from the video, subsampled. Obtained by running processing script on rgb.mkv. The resolution is 1920 x 1440.
depth: Depth images as 16 bit png in millimeters. Obtained by running processing script on depth.bin. The depth images are aligned with the RGB images but with much lower resolution: 256 x 192.
pose_intrinsic_imu.json: contains ARKit poses and IMU information from the iPhone
json["poses"] contain a 4x4 camera-to-world extrinsic matrix from raw ARKit output. The coordinate system is right-handed. +Z is the camera direction.
json["intrinsic"] contains a 3x3 intrinsic matrix of the RGB image
json["aligned_poses"] contains ARKit poses that are scaled and transformed to our mesh space
There are no intrinsics for Lidar depth provided by the iPhone. The user can scale the RGB intrinsic for the Lidar depth map since RGB and depth are aligned.
nerfstudio: similar to DSLR
colmap: similar to DSLR. Images here have been filtered based on agreement of depth between iPhone Lidar and the laser scanner. The camera model is OPENCV, which contains 4 distortion parameters: k1, k2 p1, p2.
exif.json: EXIF information for each frame in the video
panocam/
images: <scan_id>.jpg with aspect ratio approximately 2:1 or 2.5:1. Corresponds to the scan pose i in scans/scanner_poses.json
anon_mask: <scan_id>.png similar to DSLR anonymization masks
depth: <scan_id>.png depth images in millimeters in 16 bit PNG format
azim: <scan_id>.png azimuth angle images, radians*1000 in 16 bit PNG format
elev: <scan_id>.png elevation angle images, radians*1000 in 16 bit PNG format
Additionally, resized_* folders for resized images, depth, mask, azimuth, elevation which are resized to 1/4 of the original size.
See example code for usage.
All data is anonymized using the magenta color with RGB value (255,0,255). The user may fill those regions with any color using the given binary mask.
To ensure fair comparison, the scenes in nvs_test split do not contain 3D information like meshes and iphone depth maps.