Model Zoo

We provide pre-trained models for view synthesis with 3D Gaussian splatting and scale-consistent depth estimation from multi-view posed images.
We assume that the downloaded weights are stored in the pretrained directory. It's recommended to create a symbolic link from YOUR_MODEL_PATH to pretrained using

ln -s YOUR_MODEL_PATH pretrained

To verify the integrity of downloaded files, each model on this page includes its sha256sum prefix in the file name, which can be checked using the command sha256sum filename.

Gaussian Splatting

The models are trained on RealEstate10K (re10k) and/or DL3DV (dl3dv) datasets at resolutions of 256x256, 256x448, and 448x768. The number of training views ranges from 2 to 10.
The "→" symbol indicates that the models are trained in two stages. For example, "re10k → (re10k+dl3dv)" means the model is firstly trained on the RealEstate10K dataset and then fine-tuned using a combination of the RealEstate10K and DL3DV datasets.

Model	Training Data	Training Resolution	Training Views	Params (M)	Download
depthsplat-gs-small-re10k-256x256-view2-cfeab6b1.pth	re10k	256x256	2	37	download
depthsplat-gs-base-re10k-256x256-view2-ca7b6795.pth	re10k	256x256	2	117	download
depthsplat-gs-large-re10k-256x256-view2-e0f0f27a.pth	re10k	256x256	2	360	download
depthsplat-gs-base-re10k-256x448-view2-fea94f65.pth	re10k	256x448	2	117	download
depthsplat-gs-base-dl3dv-256x448-randview2-6-02c7b19d.pth	re10k → dl3dv	256x448	2-6	117	download
depthsplat-gs-small-re10kdl3dv-448x768-randview4-10-c08188db.pth	re10k → (re10k+dl3dv)	256x448 →448x768	4-10	37	download
depthsplat-gs-base-re10kdl3dv-448x768-randview2-6-f8ddd845.pth	re10k → (re10k+dl3dv)	256x448 →448x768	2-6	117	download

The depth models are trained with the following procedure:
- Initialize the monocular feature with Depth Anything V2 and the multi-view Transformer with UniMatch.
- Train the full DepthSplat model end-to-end on the mixed RealEstate10K and DL3DV datasets.
- Fine-tune the pre-trained depth model on the depth datasets with ground truth depth supervision. The depth datasets used for fine-tuning include ScanNet, TartanAir, and VKITTI2.
The depth models are fine-tuned with random numbers (2-8) of input images, and the training image resolution is 352x640.
The scale of the predicted depth is aligned with the scale of camera pose's translation.

Model	Training Data	Training Resolution	Training Views	Params (M)	Download
depthsplat-depth-small-352x640-randview2-8-e807bd82.pth	(re10k+dl3dv) → (scannet+tartanair+vkitti2)	448x768 → 352x640	2-8	36	download
depthsplat-depth-base-352x640-randview2-8-65a892c5.pth	(re10k+dl3dv) → (scannet+tartanair+vkitti2)	448x768 → 352x640	2-8	111	download