Documentation ScanNet++ Toolbox Check out our ScanNet++ Toolbox in github. It provides tools and code to Read the dataset structure Decode iPhone RGB, depth, and mask video Undistort DSLR fisheye images into pinhole Render high-res depth maps from the mesh for DSLR and iPhone frames Nerfstudio dataparser for DSLR images is ready (see this PR) Prepare training data for semantic tasks Official evaluation code for the benchmark 3D Gaussian Splatting on ScanNet++ Check out our 3DGS example in github. It provides tools and code to visualize the frames and train 3D Gaussian Splatting on the scenes in ScanNet++. It also includes scripts to render images for submitting the NVS benchmark (DSLR). Data Structure The ScanNet++ dataset currently consists of 1006 scenes. The default download (low-res DSLR images, iPhone data, 3D meshes and semantics) occupies about 1.5 TB on disk. Asset group Download size DSLR (2 MP), iPhone, Meshes and semantics (default download) 1.5 TB DSLR (2 MP) 371 GB DSLR 2MP + 33MP (hi-res) 9 TB Meshes and semantics 132 GB Point clouds 720 GB Panocam 319 GB The data download contains one folder per scene containing laser scan, DSLR and iPhone data, and several metadata files. The data is organized as follows: split/ nvs_sem_train.txt: Training set for NVS and semantic tasks with 230 scenes nvs_sem_val.txt: Validation set for NVS and semantic tasks with 50 scenes nvs_test.txt: Test set for NVS with 50 scenes nvs_test_small.txt: Smaller test set for NVS with 12 scenes, which is a subset of nvs_test.txt nvs_test_iphone.txt: Test set for NVS with iPhone data with 12 scenes sem_test.txt: Test set for semantic tasks with 50 scenes Each file contains lists of Scene IDs in the respective split metadata/ scene_types.json: scene ID to scene type mapping for all scenes semantic_classes.txt: list of semantic classes instance_classes.txt: subset of semantic classes that have instances (i.e., excludes wall, ceiling, floor, ..) semantic_benchmark/ top100.txt: top 100 semantic classes for semantic segmentation benchmark top100_instance.txt: subset of 100 semantic classes for instance segmentation benchmark map_benchmark.csv: mapping from raw semantic labels to benchmark labels data// scans/ pc_aligned.ply: point cloud from laser scanner, axis-aligned pc_aligned_mask.txt: indices of anonymized points scanner_poses.json: contains scanner positions, 4x4 transformation matrix for each position mesh_aligned_0.05.ply: mesh decimated to 5% size, obtained from point cloud mesh_aligned_0.05_mask.txt: indices of mesh vertices with anonymization applied mesh_aligned_0.05_semantic.ply The vertex “label” property contains the integer semantic label into the classes in semantic_classes.txt Unlabeled vertices have the label -100 segments.json: json_data[“segIndices”] contains the segment ID for each vertex segments_anno.json: json_data[i] corresponds to a single annotated object “label”: the semantic label of this object “segments”: all the segments belonging to this object dslr/ resized_images: Fisheye DSLR images, resized, JPG resized_anon_masks: PNG. Specifies the pixels that have been anonymized (0: invalid, 255: valid pixels). original_images: Full resolution images, JPG original_anon_masks: PNG. Similar to resized masks resized_undistorted_images: Undistorted DSLR images with the same resolution as the resized images, JPG resized_undistorted_masks: PNG. Similar to resized masks colmap: contains the colmap camera model that has been aligned with the 3D scans, which implies the poses are in metric scale. Make sure to use this if you want to do 2D-3D matching between the provided mesh. cameras.txt: Contain the camera type (OPENCV_FISHEYE) and the intrinsic parameters (fx, fy, cx, cy, distortion parameters) images.txt: Contain extrinsics of each image: qvec (quaternion) and tvec points3D.txt: Contain 3D feature points used by COLMAP Useful references: Colmap docs for camera model More info in colmap source code Python reader provided by colmap Python (Open3D) visualizer provided by COLMAP nerfstudio/ transforms.json Contains the same camera poses in the format used by Nerfstudio, OpenGL/Blender convention. The coordinate system is different from OpenCV/COLMAP convention. poses: frames, test_frames: contain poses for train and test images respectively mask: filename of binary mask file is_bad: indicates if the image is blurry or contains heavy shadows. Camera model (as above): contained in fl_x, fl_y, .., k1, k2, k3, k4, camera_model The intrinsics are corresponds to the resized images has_mask: global flag for the scene, indicating if it has anonymized masks or not transforms_undistorted.json: similar to transforms.json but the undistorted DSLR version train_test_lists.json json[“train”]: training images json[“test”]: novel views, test images The split here is the same as the one in nerfstudio/transform.json json[“has_masks”]: global flag for the scene, indicating if it has anonymized masks or not iphone/ rgb.mkv: full RGB video, 60 FPS rgb_mask.mkv: Video of anonymization masks, lossless compression. After decode, it's similar to the masks in DSLR. depth.bin: Depth images as 16 bit png in millimeters in a single binary file from iPhone Lidar sensor. The depth images are aligned with the RGB images. rgb: RGB frames from the video, subsampled. Obtained by running processing script on rgb.mkv. The resolution is 1920 x 1440. depth: Depth images as 16 bit png in millimeters. Obtained by running processing script on depth.bin. The depth images are aligned with the RGB images but with much lower resolution: 256 x 192. pose_intrinsic_imu.json: contains ARKit poses and IMU information from the iPhone json["poses"] contain a 4x4 camera-to-world extrinsic matrix from raw ARKit output. The coordinate system is right-handed. +Z is the camera direction. json["intrinsic"] contains a 3x3 intrinsic matrix of the RGB image json["aligned_poses"] contains ARKit poses that are scaled and transformed to our mesh space There are no intrinsics for Lidar depth provided by the iPhone. The user can scale the RGB intrinsic for the Lidar depth map since RGB and depth are aligned. nerfstudio: similar to DSLR colmap: similar to DSLR. Images here have been filtered based on agreement of depth between iPhone Lidar and the laser scanner. The camera model is OPENCV, which contains 4 distortion parameters: k1, k2 p1, p2. exif.json: EXIF information for each frame in the video panocam/ images: .jpg with aspect ratio approximately 2:1 or 2.5:1. Corresponds to the scan pose i in scans/scanner_poses.json anon_mask: .png similar to DSLR anonymization masks depth: .png depth images in millimeters in 16 bit PNG format azim: .png azimuth angle images, radians*1000 in 16 bit PNG format elev: .png elevation angle images, radians*1000 in 16 bit PNG format Additionally, resized_* folders for resized images, depth, mask, azimuth, elevation which are resized to 1/4 of the original size. See example code for usage. All data is anonymized using the magenta color with RGB value (255,0,255). The user may fill those regions with any color using the given binary mask. To ensure fair comparison, the scenes in nvs_test split do not contain 3D information like meshes and iphone depth maps.