Zador Pataki · Paul-Edouard Sarlin · Johannes Schönberger · Marc Pollefeys
MP-SfM augments Structure-from-Motion with monocular depth and normal priors for reliable 3D reconstruction despite extreme viewpoint changes with little visual overlap.
MP-SfM is a Structure-from-Motion pipeline that integrates monocular depth and normal predictions into classical multi-view reconstruction. This hybrid approach improves robustness in difficult scenarios such as low parallax, high symmetry, and sparse viewpoints, while maintaining strong performance in standard conditions. This repository includes code, pretrained models, and instructions for reproducing our results.
- 🔧 Setup — Install dependencies and prepare the environment.
- 🚀 Run the Demo — Try the full MP-SfM pipeline on example data.
- 🛠️ Pipeline Configurations — Customize your pipeline with OmegaConf configs.
- 📈 Extending MP-SfM: Use Your Own Priors — Integrate your own depth, normal, or matching modules.
We provide the Python package mpsfm
. First clone the repository and install the dependencies.
git clone --recursive https://github.com/cvg/mpsfm && cd mpsfm
Build pyceres and pycolmap (from our fork) from source, then install the required packages:
pip install -r requirements.txt
python -m pip install -e .
[Optional - click to expand]
-
For faster inference with the transformer-based models, install xformers
-
For faster inference with the MASt3R matcher, compile the cuda kernels for RoPE as recommended by the authors:
DIR=$PWD cd third_party/mast3r/dust3r/croco/models/curope/ python setup.py build_ext --inplace cd $DIR
Our demo notebook demonstrates a minimal usage example. It shows how to run the MP-SfM pipeline, and how to visualize the reconstruction with its multiple output modalities.
Visualizing MP-SfM sparse and dense reconstruction outputs in the demo.
Alternatively, run the reconstruction from the command line:
# Use default ⚙️
python reconstruct.py \
--conf sp-lg_m3dv2 \ # see config dir "configs" for other curated options
--data_dir local/example \ # hosts sfm inputs and outputs when other options aren't specified
--intrinsics_path local/example/intrinsics.yaml \ # path to the intrinsics file
--images_dir local/example/images \ # images directory
--cache_dir local/example/cache_dir \ # extraction outputs: depths, normals, matches, etc.
--extract \ # use ["sky", "features", "matches", "depth", "normals"] to force re-extract
--verbose 0
# Or simply run this and let argparse take care of the default inputs
python reconstruct.py
The script will reconstruct the scene in local/example, and output the reconstruction into local/example/sfm_outputs.
-
Extraction: Some configurations only cache a subset of prior outputs, for example only normals of Metric3Dv2. Re-extract using
--extract
when later using a prior pipeline that requires all outputs. -
Verbosity: Change the verbosity level of the pipeline using
--verbose
.0
provides clean output.1
offers minimal debugging output, including function benchmarking and a 3D visualization (3d.html
) saved in your--data_dir
at the end of the process.2
saves a visualization after every 5 registered images, pauses the pipeline, and provides additional debugging outputs.3
provides full debugging outputs.
[Run with your own data - click to expand]
Check out our example data directory.
-
Images: Add your images to a single folder. Add them to a folder called "images" in the
--data_dir
, or point to it via--images_dir
-
Camera Intrinsics: Create a single
.yaml
file storing all camera intrinsics. Place it in your--data_dir
and call itintrinsics.yaml
or point to it via--intrinsics_path
. Follow the structure presented in intrinsics.yaml, or see the description below:[Intrinsics file example - click to expand]
Single Camera:
# .yaml setup when images have shared intrinsics 1: params: [604.32447211, 604.666982, 696.5, 396.5] # fx, fy, cx, cy images: all # or specify the images belonging to this camera # images : # - indoor_DSC03018.JPG # - indoor_DSC03200.JPG # - indoor_DSC03081.JPG # - indoor_DSC03194.JPG # - indoor_DSC03127.JPG # - indoor_DSC03131.JPG # - indoor_DSC03218.JPG
Multiple cameras:
# .yaml setup when images have different intrinsics # camera 1 1: params: [fx1, fy1, cx1, cy1] images: - im11.jpg - im12.jpg ... # camera 2 2: params: [fx2, fy2, cx2, cy2] images: - im21.jpg - im22.jpg ...
We extend COLMAP’s incremental mapping pipeline with monocular priors for which we provide easily adjustable hyperparameters via configs.
We have fine-grained control over all hyperparameters via OmegaConf configurations, which have sensible default values defined in MpsfmMapper.default_conf
. Run this python script to display a human-readable overview of all possible adjustable parameters. Note: We import all default COLMAP hyperparameters, but only use a subset.
from mpsfm.sfm.mapper import MpsfmMapper
from mpsfm.utils.tools import summarize_cfg
print(summarize_cfg(MpsfmMapper.default_conf))
See our configuration directory for all of our carefully selected configuration setups. Each .yaml
file overwrites default configurations, with the exception of the empty default setup sp-lg_m3dv2
. Additionally, other configuration setups can be imported using defaults:
(see example). This is important because the hyperparameters in some configuration setups (see defaults) were carefully grouped.
Here, we provide an example configuration file detailing all of the important configurations.
[Click to expand]
# Untested config created to demonstrate how to write config files
# import default configs to make sure depth estimators are used with correct uncertainties
defaults:
- defaults/depthpro # in this example we use depthpro
reconstruction:
image:
depth:
depth_uncertainty: 0.2 # we can override the default uncertainty in defaults/depthpro.yaml (not recommended)
normals:
flip_consistency: true # use flip consistency check for normals (see defaults in mpsfm/sfm/scene/image/normals.py)
extractors:
# use dsine normals instead of metric3dv2 (default set in mpsfm/extraction/base.py)
# use "-fc" variant because we need flipped estimates for the "flip_consistency" check
normals: DSINE-kappa-fc
matcher: roma_outdoor #change matcher
# for dense matchers we can use any combination of sparse and dense by combining with "+"
# for mast3r, you can additionally set "depth", e.g. "sparse+dense+depth"
matches_mode: dense
# change high level mapper logic:
depth_consistency: false # removes depth consistency check
integrate: false # disables depth optimization
int_covs: true # enables optimized depth map uncertainty prop.
# more advanced mapper options
triangulator:
# avoids introducing 3D points with large errors (during retriangulation) for images that observe
# less than 120 3D points with track length<2 (defaults in mpsfm/sfm/mapper/triangulator.py)
nsafe_threshold: 120
colmap_options:
min_angle: 0.1 # increase minimum triangulation angle from default (defaults in mpsfm/sfm/mapper/triangulator.py)
- sp-lg_m3dv2 ⚡️ (default): Fastest reconstruction with very precise camera poses. Failure cases only in scenes with little texture or very challenging viewpoint changes
- sp-mast3r 💪: Robust reconstruction even in egregious viewpoint changes and very low overlap. Thanks to anchoring matches around Superpoint keypoints, reconstruction is also precise.
- sp-mast3r-dense 💪: Like above, however also leverages dense correspondences in non-salient regions. As a result, this configuration is capable of reconstructing scenes in the most challenging setups: very low-overlap + low texture + egregious view point changes (e.g. opposing views). This, however, comes at the cost of precision.
- sp-roma-dense_m3dv2 🏋️: In the absence of egregious viewpoint changes, this is our most accurate pipeline, however, also the most expensive.
Below, we detail the benefits of the key priors we recommend, in case the user wants to mix the configurations.
Check out the available feature extraction and matching configurations.
Our default pipeline is built on top of Superpoint+LightGlue. However, using additional computational resources, we can get improved accuracies on low overlap scenes using dense matchers. Our pipeline supports three matching modes (sparse
, dense
, sparse+dense
). See our demo for more details.
[Configuration Recommendations - click to expand]
We recommend using sparse
or sparse+dense
:
- Superpoint+LightGlue: Fast ⚡️ and precise, however struggles under harsh viewpoint changes.
- MASt3r
sparse
: Robust 💪 against egregious viewpoint changes (like opposing views) and also precise thanks to Superpoint keypoints, with a moderate extraction speed.sparse+dense
: Robust 💪 even in featureless environments, however, precision and extraction speed drops.
- RoMA
sparse+dense
: Best performance 💥 in low overlap scenarios without symmetries and difficult viewpoint changes, however resource intensive, cannot match egregious viewpoint changes and struggles to reject negative pairs (symmetry issues): Good performance, however sampling sparse matches from RoMA doubles the extraction time. Better to usesparse
dense
in challenging scenarios or a faster matcher
Our leverages depth and normal estimators and their corresponding uncertainties. We carefully calibrated uncertainties per depth estimator. We found that uncertainties estimated by the network (where applicable) and modeling uncertainties proportional to the depth estimates was reliable (see per-estimator setups).
[Configuration Recommendations - click to expand]
- Metric3Dv2:
- Giant2 (our default): Great generalizable estimates 💥, at the cost of extraction speed and GPU memory
- Large mainta 8000 ins performance against Giant 💪 in many scenarios while significantly improving extraction speed. Small provides very fast ⚡️ extraction, and performs sufficiently well in easy scenarios
- DepthPro: Competes with Metric3Dv2-Giant2 in depth quality 💪; however, with similarly large extraction times and is limited by a lack of predicted uncertainties
- DepthAnythingV2: Reasonable performance in small scale environments
- MASt3R: estimates depth maps using two input views. As a result, achieves the best performance 💥 at extracting relative scales between background and foreground objects; critical in some low-overlap scenarios
- Metric3Dv2:
- DSINE: fastest ⚡️ extraction times, however, with a drop in generalizability
Our extractors follow the hloc format. Thus, MP-SfM can easily be extended with improvements in monocular surface estimators with minimal effort. Monocular surface prior improvements (surface and uncertainty predictions) will facilitate more robust reconstruction qualities in the most challenging scenarios. Moreover, the pipeline could greatly benefit from improved matchers capable of rejecting negative pairs.
[Configuration Recommendations - click to expand]
- We extract and match sparse features using hloc modules (see feature configs and matcher configs)
- Follow the structure presented in superpoint to add your own matcher
- Follow the structure presented in lightglue to add your own matcher
- Our dense matching framework with accompanying config files can match both salient features and sample matches on featureless regions.
- We support two types of dense feature matchers. Both of which interpolate predictions around salient features to match them. Follow the corresponding structures:
- Feature map pair (utils): Networks output feature maps per image, and sample matches through a nearest neighbors search
- Warp (utils): Networks directly predict pixelwise correspondences
- See our monocular prior extraction framework and its accompanying config files
- For predicting both depth and normals, follow this class structure
- Our pipeline relies on monocular prior uncertainties which require calibration. Check out the different uncertainty representations [
prior_uncertainty
,flip_consistency
,depth_uncertainty
] in the Depth Object and similarly [prior_uncertainty
,flip_consistency
] in the Normals Object - For leveraging
flip_consistency
, the model must extract two sets of priors per image (see config). This, however, doubles the extraction time and storage requirements - If your matcher also extracts depth maps, follow this class structure
If you use any ideas from the paper or code from this repo, please consider citing:
@inproceedings{pataki2025mpsfm,
author = {Zador Pataki and
Paul-Edouard Sarlin and
Johannes L. Sch\"onberger and
Marc Pollefeys},
title = {{MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion}},
booktitle = {CVPR},
year = {2025}
}