\
+ --num_inputs \
+ --video_save_fps 10
+```
+
+- `--num_inputs
` is only necessary if there are multiple `train_test_split_*.json` files in the scene folder.
+- The above command works for the dataset without trajectory prior (e.g., DL3DV-140). When the trajectory prior is available given a benchmarking dataset, for example, `orbit` trajectory prior for the CO3D dataset, we use the `nearest-gt` chunking strategy by setting `--use_traj_prior True --traj_prior orbit --chunking_strategy nearest-gt`. We find this leads to more 3D consistent results.
+- For all the single-view conditioning test scenarios: we set `--camera_scale ` with `` sweeping 20 different camera scales `0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0`.
+- In single-view regime for the RealEstate10K dataset, we find increasing `cfg` is helpful: we additionally set `--cfg 6.0` (`cfg` is `2.0` by default).
+- For the evaluation in semi-dense-view regime (i.e., DL3DV-140 and Tanks and Temples dataset) with `32` input views, we zero-shot extend `T` to fit all input and target views in one forward. Specifically, we set `--T 90` for the DL3DV-140 dataset and `--T 80` for the Tanks and Temples dataset.
+- For the evaluation on ViewCrafter split (including the RealEastate10K, CO3D, and Tanks and Temples dataset), we find zero-shot extending `T` to `25` to fit all input and target views in one forward is better. Also, the V split uses the original image resolutions: we therefore set `--T 25 --L_short 576`.
+
+For example, you can run the following command on the example `dl3d140-165f5af8bfe32f70595a1c9393a6e442acf7af019998275144f605b89a306557` with 3 input views:
+
+```bash
+python demo.py \
+ --data_path /path/to/assets_demo_cli/ \
+ --data_items dl3d140-165f5af8bfe32f70595a1c9393a6e442acf7af019998275144f605b89a306557 \
+ --num_inputs 3 \
+ --video_save_fps 10
+```
+
+## `img2vid`
+
+```bash
+python demo.py \
+ --data_path \
+ --task img2vid \
+ --replace_or_include_input True \
+ --num_inputs \
+ --use_traj_prior True \
+ --chunk_strategy interp \
+```
+
+- `--replace_or_include_input True` is necessary here since input views and target views are mutually exclusive, forming a trajectory together in this task, so we need to append back the input views to the generated target views.
+- `--num_inputs
` is only necessary if there are multiple `train_test_split_*.json` files in the scene folder.
+- We use `interp` chunking strategy by default.
+- For the evaluation on ViewCrafter split (including the RealEastate10K, CO3D, and Tanks and Temples dataset), we find zero-shot extending `T` to `25` to fit all input and target views in one forward is better. Also, the V split uses the original image resolutions: we therefore set `--T 25 --L_short 576`.
+
+## `img2trajvid_s-prob`
+
+```bash
+python demo.py \
+ --data_path \
+ --task img2trajvid_s-prob \
+ --replace_or_include_input True \
+ --traj_prior orbit \
+ --cfg 4.0,2.0 \
+ --guider 1,2 \
+ --num_targets 111 \
+ --L_short 576 \
+ --use_traj_prior True \
+ --chunk_strategy interp
+```
+
+- `--replace_or_include_input True` is necessary here since input views and target views are mutually exclusive, forming a trajectory together in this task, so we need to append back the input views to the generated target views.
+- Default `cfg` should be adusted according to `traj_prior`.
+- Default chunking strategy is `interp`.
+- Default guider is `--guider 1,2` (instead of `1`, `1` still works but `1,2` is slightly better).
+- `camera_scale` (default is `2.0`) can be adjusted according to `traj_prior`. The model has scale ambiguity with single-view input, especially for panning motions. We encourage to tune up `camera_scale` to `10.0` for all panning motions (`--traj_prior pan-*/dolly*`) if you expect a larger camera motion.
+
+## `img2trajvid`
+
+### Sparse-view regime ($P\leq 8$)
+
+```bash
+python demo.py \
+ --data_path \
+ --task img2trajvid \
+ --num_inputs \
+ --cfg 3.0,2.0 \
+ --use_traj_prior True \
+ --chunk_strategy interp-gt
+```
+
+- `--num_inputs
` is only necessary if there are multiple `train_test_split_*.json` files in the scene folder.
+- Default `cfg` should be set to `3,2` (`3` being `cfg` for the first pass, and `2` being the `cfg` for the second pass). Try to increase the `cfg` for the first pass from `3` to higher values if you observe blurry areas (usually happens for harder scenes with a fair amount of unseen regions).
+- Default chunking strategy should be set to `interp+gt` (instead of `interp`, `interp` can work but usually a bit worse).
+- The `--chunk_strategy_first_pass` is set as `gt-nearest` by default. So it can automatically adapt when $P$ is large (up to a thousand frames).
+
+### Semi-dense-view regime ($P>9$)
+
+```bash
+python demo.py \
+ --data_path \
+ --task img2trajvid \
+ --num_inputs \
+ --cfg 3.0 \
+ --L_short 576 \
+ --use_traj_prior True \
+ --chunk_strategy interp
+```
+
+- `--num_inputs
` is only necessary if there are multiple `train_test_split_*.json` files in the scene folder.
+- Default `cfg` should be set to `3`.
+- Default chunking strategy should be set to `interp` (instead of `interp-gt`, `interp-gt` is also supported but the results do not look good).
+- `T` can be overwritten by `--T ,21` (X being extended `T` for the first pass, and `21` being the default `T` for the second pass). `` is dynamically decided now in the code but can also be manually updated. This is useful when you observe that there exist two very dissimilar adjacent anchors which make the interpolation in the second pass impossible. There exist two ways:
+ - `--T 96,21`: this overwrites the `T` in the first pass to be exactly `96`.
+ - `--num_prior_frames_ratio 1.2`: this enlarges T in the first pass dynamically to be `1.2`$\times$ larger.
diff --git a/docs/GR_USAGE.md b/docs/GR_USAGE.md
new file mode 100644
index 0000000000000000000000000000000000000000..9341d552da4ce79fb565fb5a425c7243924bf9d9
--- /dev/null
+++ b/docs/GR_USAGE.md
@@ -0,0 +1,76 @@
+# :rocket: Gradio Demo
+
+This gradio demo is the simplest starting point for you play with our project.
+
+You can either visit it at our huggingface space [here](https://huggingface.co/spaces/stabilityai/stable-virtual-camera) or run it locally yourself by
+
+```bash
+python demo_gr.py
+```
+
+We provide two ways to use our demo:
+
+1. `Basic` mode, where user can upload a single image, and set a target camera trajectory from our preset options. This is the most straightforward way to use our model, and is suitable for most users.
+2. `Advanced` mode, where user can upload one or multiple images, and set a target camera trajectory by interacting with a 3D viewport (powered by [viser](https://viser.studio/latest)). This is suitable for power users and academic researchers.
+
+### `Basic`
+
+This is the default mode when entering our demo (given its simplicity).
+
+User can upload a single image, and set a target camera trajectory from our preset options. This is the most straightforward way to use our model, and is suitable for most users.
+
+Here is a video walkthrough:
+
+https://github.com/user-attachments/assets/4d965fa6-d8eb-452c-b773-6e09c88ca705
+
+You can choose from 13 preset trajectories that are common for NVS (`move-forward/backward` are omitted for visualization purpose):
+
+https://github.com/user-attachments/assets/b2cf8700-3d85-44b9-8d52-248e82f1fb55
+
+More formally:
+
+- `orbit/spiral/lemniscate` are good for showing the "3D-ness" of the scene.
+- `zoom-in/out` keep the camera position the same while increasing/decreasing the focal length.
+- `dolly zoom-in/out` move camera position backward/forward while increasing/decreasing the focal length.
+- `move-forward/backward/up/down/left/right` move camera position in different directions.
+
+Notes:
+
+- For a 80 frame video at `786x576` resolution, it takes around 20 seconds for the first pass generation, and around 2 minutes for the second pass generation, tested with a single H100 GPU.
+- Please expect around ~2-3x more times on HF space.
+
+### `Advanced`
+
+This is the power mode where you can have very fine-grained control over camera trajectories.
+
+User can upload one or multiple images, and set a target camera trajectory by interacting with a 3D viewport. This is suitable for power users and academic researchers.
+
+Here is a video walkthrough
+
+https://github.com/user-attachments/assets/dcec1be0-bd10-441e-879c-d1c2b63091ba
+
+Notes:
+
+- For a 134 frame video at `576x576` resolution, it takes around 16 seconds for the first pass generation, and around 4 minutes for the second pass generation, tested with a single H100 GPU.
+- Please expect around ~2-3x more times on HF space.
+
+### Pro tips
+
+- If the first pass sampling result is bad, click "Abort rendering" button in GUI to avoid stucking at second pass sampling such that you can try something else.
+
+### Performance benchmark
+
+We have tested our gradio demo in both a local environment and the HF space environment, across different modes and compilation settings. Here are our results:
+| Total time (s) | `Basic` first pass | `Basic` second pass | `Advanced` first pass | `Advanced` second pass |
+|:------------------------:|:-----------------:|:------------------:|:--------------------:|:---------------------:|
+| HF (L40S, w/o comp.) | 68 | 484 | 48 | 780 |
+| HF (L40S, w/ comp.) | 51 | 362 | 36 | 587 |
+| Local (H100, w/o comp.) | 35 | 204 | 20 | 313 |
+| Local (H100, w/ comp.) | 21 | 144 | 16 | 234 |
+
+Notes:
+
+- HF space uses L40S GPU, and our local environment uses H100 GPU.
+- We opt-in compilation by `torch.compile`.
+- `Basic` mode is tested by generating 80 frames at `768x576` resolution.
+- `Advanced` mode is tested by generating 134 frames at `576x576` resolution.
diff --git a/docs/INSTALL.md b/docs/INSTALL.md
new file mode 100644
index 0000000000000000000000000000000000000000..47f971fe9242232b2e614f7a0dd3cc69eacf07fe
--- /dev/null
+++ b/docs/INSTALL.md
@@ -0,0 +1,39 @@
+# :wrench: Installation
+
+### Model Dependencies
+
+```bash
+# Install seva model dependencies.
+pip install -e .
+```
+
+### Demo Dependencies
+
+To use the cli demo (`demo.py`) or the gradio demo (`demo_gr.py`), do the following:
+
+```bash
+# Initialize and update submodules for demo.
+git submodule update --init --recursive
+
+# Install pycolmap dependencies for cli and gradio demo (our model is not dependent on it).
+echo "Installing pycolmap (for both cli and gradio demo)..."
+pip install git+https://github.com/jensenz-sai/pycolmap@543266bc316df2fe407b3a33d454b310b1641042
+
+# Install dust3r dependencies for gradio demo (our model is not dependent on it).
+echo "Installing dust3r dependencies (only for gradio demo)..."
+pushd third_party/dust3r
+pip install -r requirements.txt
+popd
+```
+
+### Dev and Speeding Up (Optional)
+
+```bash
+# [OPTIONAL] Install seva dependencies for development.
+pip install -e ".[dev]"
+pre-commit install
+
+# [OPTIONAL] Install the torch nightly version for faster JIT via. torch.compile (speed up sampling by 2x in our testing).
+# Please adjust to your own cuda version. For example, if you have cuda 11.8, use the following command.
+pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118
+```
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000000000000000000000000000000000000..cfe20303e39d27dd926d0fc2494aa06622bf3d8c
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,39 @@
+[build-system]
+requires = ["setuptools>=65.5.3"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "seva"
+version = "0.0.0"
+requires-python = ">=3.10"
+dependencies = [
+ "torch>=2.6.0",
+ "roma",
+ "viser",
+ "tyro",
+ "fire",
+ "ninja",
+ "gradio==5.17.0",
+ "einops",
+ "colorama",
+ "splines",
+ "kornia",
+ "open-clip-torch",
+ "diffusers",
+ "numpy==1.24.4",
+ "imageio[ffmpeg]",
+ "huggingface-hub",
+ "opencv-python",
+]
+
+[project.optional-dependencies]
+dev = ["ruff", "ipdb", "pytest", "line_profiler", "pre-commit"]
+
+[tool.setuptools.packages.find]
+include = ["seva"]
+
+[tool.pyright]
+extraPaths = ["third_party/dust3r"]
+
+[tool.ruff]
+lint.ignore = ["E741"]
diff --git a/seva/__init__.py b/seva/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/seva/data_io.py b/seva/data_io.py
new file mode 100644
index 0000000000000000000000000000000000000000..51035663516d6c7aa2521ed9085d7e39fbade3a2
--- /dev/null
+++ b/seva/data_io.py
@@ -0,0 +1,553 @@
+import json
+import os
+import os.path as osp
+from glob import glob
+from typing import Any, Dict, List, Optional, Tuple
+
+import cv2
+import imageio.v3 as iio
+import numpy as np
+import torch
+
+from seva.geometry import (
+ align_principle_axes,
+ similarity_from_cameras,
+ transform_cameras,
+ transform_points,
+)
+
+
+def _get_rel_paths(path_dir: str) -> List[str]:
+ """Recursively get relative paths of files in a directory."""
+ paths = []
+ for dp, _, fn in os.walk(path_dir):
+ for f in fn:
+ paths.append(os.path.relpath(os.path.join(dp, f), path_dir))
+ return paths
+
+
+class BaseParser(object):
+ def __init__(
+ self,
+ data_dir: str,
+ factor: int = 1,
+ normalize: bool = False,
+ test_every: Optional[int] = 8,
+ ):
+ self.data_dir = data_dir
+ self.factor = factor
+ self.normalize = normalize
+ self.test_every = test_every
+
+ self.image_names: List[str] = [] # (num_images,)
+ self.image_paths: List[str] = [] # (num_images,)
+ self.camtoworlds: np.ndarray = np.zeros((0, 4, 4)) # (num_images, 4, 4)
+ self.camera_ids: List[int] = [] # (num_images,)
+ self.Ks_dict: Dict[int, np.ndarray] = {} # Dict of camera_id -> K
+ self.params_dict: Dict[int, np.ndarray] = {} # Dict of camera_id -> params
+ self.imsize_dict: Dict[
+ int, Tuple[int, int]
+ ] = {} # Dict of camera_id -> (width, height)
+ self.points: np.ndarray = np.zeros((0, 3)) # (num_points, 3)
+ self.points_err: np.ndarray = np.zeros((0,)) # (num_points,)
+ self.points_rgb: np.ndarray = np.zeros((0, 3)) # (num_points, 3)
+ self.point_indices: Dict[str, np.ndarray] = {} # Dict of image_name -> (M,)
+ self.transform: np.ndarray = np.zeros((4, 4)) # (4, 4)
+
+ self.mapx_dict: Dict[int, np.ndarray] = {} # Dict of camera_id -> (H, W)
+ self.mapy_dict: Dict[int, np.ndarray] = {} # Dict of camera_id -> (H, W)
+ self.roi_undist_dict: Dict[int, Tuple[int, int, int, int]] = (
+ dict()
+ ) # Dict of camera_id -> (x, y, w, h)
+ self.scene_scale: float = 1.0
+
+
+class DirectParser(BaseParser):
+ def __init__(
+ self,
+ imgs: List[np.ndarray],
+ c2ws: np.ndarray,
+ Ks: np.ndarray,
+ points: Optional[np.ndarray] = None,
+ points_rgb: Optional[np.ndarray] = None, # uint8
+ mono_disps: Optional[List[np.ndarray]] = None,
+ normalize: bool = False,
+ test_every: Optional[int] = None,
+ ):
+ super().__init__("", 1, normalize, test_every)
+
+ self.image_names = [f"{i:06d}" for i in range(len(imgs))]
+ self.image_paths = ["null" for _ in range(len(imgs))]
+ self.camtoworlds = c2ws
+ self.camera_ids = [i for i in range(len(imgs))]
+ self.Ks_dict = {i: K for i, K in enumerate(Ks)}
+ self.imsize_dict = {
+ i: (img.shape[1], img.shape[0]) for i, img in enumerate(imgs)
+ }
+ if points is not None:
+ self.points = points
+ assert points_rgb is not None
+ self.points_rgb = points_rgb
+ self.points_err = np.zeros((len(points),))
+
+ self.imgs = imgs
+ self.mono_disps = mono_disps
+
+ # Normalize the world space.
+ if normalize:
+ T1 = similarity_from_cameras(self.camtoworlds)
+ self.camtoworlds = transform_cameras(T1, self.camtoworlds)
+
+ if points is not None:
+ self.points = transform_points(T1, self.points)
+ T2 = align_principle_axes(self.points)
+ self.camtoworlds = transform_cameras(T2, self.camtoworlds)
+ self.points = transform_points(T2, self.points)
+ else:
+ T2 = np.eye(4)
+
+ self.transform = T2 @ T1
+ else:
+ self.transform = np.eye(4)
+
+ # size of the scene measured by cameras
+ camera_locations = self.camtoworlds[:, :3, 3]
+ scene_center = np.mean(camera_locations, axis=0)
+ dists = np.linalg.norm(camera_locations - scene_center, axis=1)
+ self.scene_scale = np.max(dists)
+
+
+class COLMAPParser(BaseParser):
+ """COLMAP parser."""
+
+ def __init__(
+ self,
+ data_dir: str,
+ factor: int = 1,
+ normalize: bool = False,
+ test_every: Optional[int] = 8,
+ image_folder: str = "images",
+ colmap_folder: str = "sparse/0",
+ ):
+ super().__init__(data_dir, factor, normalize, test_every)
+
+ colmap_dir = os.path.join(data_dir, colmap_folder)
+ assert os.path.exists(
+ colmap_dir
+ ), f"COLMAP directory {colmap_dir} does not exist."
+
+ try:
+ from pycolmap import SceneManager
+ except ImportError:
+ raise ImportError(
+ "Please install pycolmap to use the data parsers: "
+ " `pip install git+https://github.com/jensenz-sai/pycolmap.git@543266bc316df2fe407b3a33d454b310b1641042`"
+ )
+
+ manager = SceneManager(colmap_dir)
+ manager.load_cameras()
+ manager.load_images()
+ manager.load_points3D()
+
+ # Extract extrinsic matrices in world-to-camera format.
+ imdata = manager.images
+ w2c_mats = []
+ camera_ids = []
+ Ks_dict = dict()
+ params_dict = dict()
+ imsize_dict = dict() # width, height
+ bottom = np.array([0, 0, 0, 1]).reshape(1, 4)
+ for k in imdata:
+ im = imdata[k]
+ rot = im.R()
+ trans = im.tvec.reshape(3, 1)
+ w2c = np.concatenate([np.concatenate([rot, trans], 1), bottom], axis=0)
+ w2c_mats.append(w2c)
+
+ # support different camera intrinsics
+ camera_id = im.camera_id
+ camera_ids.append(camera_id)
+
+ # camera intrinsics
+ cam = manager.cameras[camera_id]
+ fx, fy, cx, cy = cam.fx, cam.fy, cam.cx, cam.cy
+ K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
+ K[:2, :] /= factor
+ Ks_dict[camera_id] = K
+
+ # Get distortion parameters.
+ type_ = cam.camera_type
+ if type_ == 0 or type_ == "SIMPLE_PINHOLE":
+ params = np.empty(0, dtype=np.float32)
+ camtype = "perspective"
+ elif type_ == 1 or type_ == "PINHOLE":
+ params = np.empty(0, dtype=np.float32)
+ camtype = "perspective"
+ if type_ == 2 or type_ == "SIMPLE_RADIAL":
+ params = np.array([cam.k1, 0.0, 0.0, 0.0], dtype=np.float32)
+ camtype = "perspective"
+ elif type_ == 3 or type_ == "RADIAL":
+ params = np.array([cam.k1, cam.k2, 0.0, 0.0], dtype=np.float32)
+ camtype = "perspective"
+ elif type_ == 4 or type_ == "OPENCV":
+ params = np.array([cam.k1, cam.k2, cam.p1, cam.p2], dtype=np.float32)
+ camtype = "perspective"
+ elif type_ == 5 or type_ == "OPENCV_FISHEYE":
+ params = np.array([cam.k1, cam.k2, cam.k3, cam.k4], dtype=np.float32)
+ camtype = "fisheye"
+ assert (
+ camtype == "perspective" # type: ignore
+ ), f"Only support perspective camera model, got {type_}"
+
+ params_dict[camera_id] = params # type: ignore
+
+ # image size
+ imsize_dict[camera_id] = (cam.width // factor, cam.height // factor)
+
+ print(
+ f"[Parser] {len(imdata)} images, taken by {len(set(camera_ids))} cameras."
+ )
+
+ if len(imdata) == 0:
+ raise ValueError("No images found in COLMAP.")
+ if not (type_ == 0 or type_ == 1): # type: ignore
+ print("Warning: COLMAP Camera is not PINHOLE. Images have distortion.")
+
+ w2c_mats = np.stack(w2c_mats, axis=0)
+
+ # Convert extrinsics to camera-to-world.
+ camtoworlds = np.linalg.inv(w2c_mats)
+
+ # Image names from COLMAP. No need for permuting the poses according to
+ # image names anymore.
+ image_names = [imdata[k].name for k in imdata]
+
+ # Previous Nerf results were generated with images sorted by filename,
+ # ensure metrics are reported on the same test set.
+ inds = np.argsort(image_names)
+ image_names = [image_names[i] for i in inds]
+ camtoworlds = camtoworlds[inds]
+ camera_ids = [camera_ids[i] for i in inds]
+
+ # Load images.
+ if factor > 1:
+ image_dir_suffix = f"_{factor}"
+ else:
+ image_dir_suffix = ""
+ colmap_image_dir = os.path.join(data_dir, image_folder)
+ image_dir = os.path.join(data_dir, image_folder + image_dir_suffix)
+ for d in [image_dir, colmap_image_dir]:
+ if not os.path.exists(d):
+ raise ValueError(f"Image folder {d} does not exist.")
+
+ # Downsampled images may have different names vs images used for COLMAP,
+ # so we need to map between the two sorted lists of files.
+ colmap_files = sorted(_get_rel_paths(colmap_image_dir))
+ image_files = sorted(_get_rel_paths(image_dir))
+ colmap_to_image = dict(zip(colmap_files, image_files))
+ image_paths = [os.path.join(image_dir, colmap_to_image[f]) for f in image_names]
+
+ # 3D points and {image_name -> [point_idx]}
+ points = manager.points3D.astype(np.float32) # type: ignore
+ points_err = manager.point3D_errors.astype(np.float32) # type: ignore
+ points_rgb = manager.point3D_colors.astype(np.uint8) # type: ignore
+ point_indices = dict()
+
+ image_id_to_name = {v: k for k, v in manager.name_to_image_id.items()}
+ for point_id, data in manager.point3D_id_to_images.items():
+ for image_id, _ in data:
+ image_name = image_id_to_name[image_id]
+ point_idx = manager.point3D_id_to_point3D_idx[point_id]
+ point_indices.setdefault(image_name, []).append(point_idx)
+ point_indices = {
+ k: np.array(v).astype(np.int32) for k, v in point_indices.items()
+ }
+
+ # Normalize the world space.
+ if normalize:
+ T1 = similarity_from_cameras(camtoworlds)
+ camtoworlds = transform_cameras(T1, camtoworlds)
+ points = transform_points(T1, points)
+
+ T2 = align_principle_axes(points)
+ camtoworlds = transform_cameras(T2, camtoworlds)
+ points = transform_points(T2, points)
+
+ transform = T2 @ T1
+ else:
+ transform = np.eye(4)
+
+ self.image_names = image_names # List[str], (num_images,)
+ self.image_paths = image_paths # List[str], (num_images,)
+ self.camtoworlds = camtoworlds # np.ndarray, (num_images, 4, 4)
+ self.camera_ids = camera_ids # List[int], (num_images,)
+ self.Ks_dict = Ks_dict # Dict of camera_id -> K
+ self.params_dict = params_dict # Dict of camera_id -> params
+ self.imsize_dict = imsize_dict # Dict of camera_id -> (width, height)
+ self.points = points # np.ndarray, (num_points, 3)
+ self.points_err = points_err # np.ndarray, (num_points,)
+ self.points_rgb = points_rgb # np.ndarray, (num_points, 3)
+ self.point_indices = point_indices # Dict[str, np.ndarray], image_name -> [M,]
+ self.transform = transform # np.ndarray, (4, 4)
+
+ # undistortion
+ self.mapx_dict = dict()
+ self.mapy_dict = dict()
+ self.roi_undist_dict = dict()
+ for camera_id in self.params_dict.keys():
+ params = self.params_dict[camera_id]
+ if len(params) == 0:
+ continue # no distortion
+ assert camera_id in self.Ks_dict, f"Missing K for camera {camera_id}"
+ assert (
+ camera_id in self.params_dict
+ ), f"Missing params for camera {camera_id}"
+ K = self.Ks_dict[camera_id]
+ width, height = self.imsize_dict[camera_id]
+ K_undist, roi_undist = cv2.getOptimalNewCameraMatrix(
+ K, params, (width, height), 0
+ )
+ mapx, mapy = cv2.initUndistortRectifyMap(
+ K,
+ params,
+ None,
+ K_undist,
+ (width, height),
+ cv2.CV_32FC1, # type: ignore
+ )
+ self.Ks_dict[camera_id] = K_undist
+ self.mapx_dict[camera_id] = mapx
+ self.mapy_dict[camera_id] = mapy
+ self.roi_undist_dict[camera_id] = roi_undist # type: ignore
+
+ # size of the scene measured by cameras
+ camera_locations = camtoworlds[:, :3, 3]
+ scene_center = np.mean(camera_locations, axis=0)
+ dists = np.linalg.norm(camera_locations - scene_center, axis=1)
+ self.scene_scale = np.max(dists)
+
+
+class ReconfusionParser(BaseParser):
+ def __init__(self, data_dir: str, normalize: bool = False):
+ super().__init__(data_dir, 1, normalize, test_every=None)
+
+ def get_num(p):
+ return p.split("_")[-1].removesuffix(".json")
+
+ splits_per_num_input_frames = {}
+ num_input_frames = [
+ int(get_num(p)) if get_num(p).isdigit() else get_num(p)
+ for p in sorted(glob(osp.join(data_dir, "train_test_split_*.json")))
+ ]
+ for num_input_frames in num_input_frames:
+ with open(
+ osp.join(
+ data_dir,
+ f"train_test_split_{num_input_frames}.json",
+ )
+ ) as f:
+ splits_per_num_input_frames[num_input_frames] = json.load(f)
+ self.splits_per_num_input_frames = splits_per_num_input_frames
+
+ with open(osp.join(data_dir, "transforms.json")) as f:
+ metadata = json.load(f)
+
+ image_names, image_paths, camtoworlds = [], [], []
+ for frame in metadata["frames"]:
+ if frame["file_path"] is None:
+ image_path = image_name = None
+ else:
+ image_path = osp.join(data_dir, frame["file_path"])
+ image_name = osp.basename(image_path)
+ image_paths.append(image_path)
+ image_names.append(image_name)
+ camtoworld = np.array(frame["transform_matrix"])
+ if "applied_transform" in metadata:
+ applied_transform = np.concatenate(
+ [metadata["applied_transform"], [[0, 0, 0, 1]]], axis=0
+ )
+ camtoworld = applied_transform @ camtoworld
+ camtoworlds.append(camtoworld)
+ camtoworlds = np.array(camtoworlds)
+ camtoworlds[:, :, [1, 2]] *= -1
+
+ # Normalize the world space.
+ if normalize:
+ T1 = similarity_from_cameras(camtoworlds)
+ camtoworlds = transform_cameras(T1, camtoworlds)
+ self.transform = T1
+ else:
+ self.transform = np.eye(4)
+
+ self.image_names = image_names
+ self.image_paths = image_paths
+ self.camtoworlds = camtoworlds
+ self.camera_ids = list(range(len(image_paths)))
+ self.Ks_dict = {
+ i: np.array(
+ [
+ [
+ metadata.get("fl_x", frame.get("fl_x", None)),
+ 0.0,
+ metadata.get("cx", frame.get("cx", None)),
+ ],
+ [
+ 0.0,
+ metadata.get("fl_y", frame.get("fl_y", None)),
+ metadata.get("cy", frame.get("cy", None)),
+ ],
+ [0.0, 0.0, 1.0],
+ ]
+ )
+ for i, frame in enumerate(metadata["frames"])
+ }
+ self.imsize_dict = {
+ i: (
+ metadata.get("w", frame.get("w", None)),
+ metadata.get("h", frame.get("h", None)),
+ )
+ for i, frame in enumerate(metadata["frames"])
+ }
+ # When num_input_frames is None, use all frames for both training and
+ # testing.
+ # self.splits_per_num_input_frames[None] = {
+ # "train_ids": list(range(len(image_paths))),
+ # "test_ids": list(range(len(image_paths))),
+ # }
+
+ # size of the scene measured by cameras
+ camera_locations = camtoworlds[:, :3, 3]
+ scene_center = np.mean(camera_locations, axis=0)
+ dists = np.linalg.norm(camera_locations - scene_center, axis=1)
+ self.scene_scale = np.max(dists)
+
+ self.bounds = None
+ if osp.exists(osp.join(data_dir, "bounds.npy")):
+ self.bounds = np.load(osp.join(data_dir, "bounds.npy"))
+ scaling = np.linalg.norm(self.transform[0, :3])
+ self.bounds = self.bounds / scaling
+
+
+class Dataset(torch.utils.data.Dataset):
+ """A simple dataset class."""
+
+ def __init__(
+ self,
+ parser: BaseParser,
+ split: str = "train",
+ num_input_frames: Optional[int] = None,
+ patch_size: Optional[int] = None,
+ load_depths: bool = False,
+ load_mono_disps: bool = False,
+ ):
+ self.parser = parser
+ self.split = split
+ self.num_input_frames = num_input_frames
+ self.patch_size = patch_size
+ self.load_depths = load_depths
+ self.load_mono_disps = load_mono_disps
+ if load_mono_disps:
+ assert isinstance(parser, DirectParser)
+ assert parser.mono_disps is not None
+ if isinstance(parser, ReconfusionParser):
+ ids_per_split = parser.splits_per_num_input_frames[num_input_frames]
+ self.indices = ids_per_split[
+ "train_ids" if split == "train" else "test_ids"
+ ]
+ else:
+ indices = np.arange(len(self.parser.image_names))
+ if split == "train":
+ self.indices = (
+ indices[indices % self.parser.test_every != 0]
+ if self.parser.test_every is not None
+ else indices
+ )
+ else:
+ self.indices = (
+ indices[indices % self.parser.test_every == 0]
+ if self.parser.test_every is not None
+ else indices
+ )
+
+ def __len__(self):
+ return len(self.indices)
+
+ def __getitem__(self, item: int) -> Dict[str, Any]:
+ index = self.indices[item]
+ if isinstance(self.parser, DirectParser):
+ image = self.parser.imgs[index]
+ else:
+ image = iio.imread(self.parser.image_paths[index])[..., :3]
+ camera_id = self.parser.camera_ids[index]
+ K = self.parser.Ks_dict[camera_id].copy() # undistorted K
+ params = self.parser.params_dict.get(camera_id, None)
+ camtoworlds = self.parser.camtoworlds[index]
+
+ x, y, w, h = 0, 0, image.shape[1], image.shape[0]
+ if params is not None and len(params) > 0:
+ # Images are distorted. Undistort them.
+ mapx, mapy = (
+ self.parser.mapx_dict[camera_id],
+ self.parser.mapy_dict[camera_id],
+ )
+ image = cv2.remap(image, mapx, mapy, cv2.INTER_LINEAR)
+ x, y, w, h = self.parser.roi_undist_dict[camera_id]
+ image = image[y : y + h, x : x + w]
+
+ if self.patch_size is not None:
+ # Random crop.
+ h, w = image.shape[:2]
+ x = np.random.randint(0, max(w - self.patch_size, 1))
+ y = np.random.randint(0, max(h - self.patch_size, 1))
+ image = image[y : y + self.patch_size, x : x + self.patch_size]
+ K[0, 2] -= x
+ K[1, 2] -= y
+
+ data = {
+ "K": torch.from_numpy(K).float(),
+ "camtoworld": torch.from_numpy(camtoworlds).float(),
+ "image": torch.from_numpy(image).float(),
+ "image_id": item, # the index of the image in the dataset
+ }
+
+ if self.load_depths:
+ # projected points to image plane to get depths
+ worldtocams = np.linalg.inv(camtoworlds)
+ image_name = self.parser.image_names[index]
+ point_indices = self.parser.point_indices[image_name]
+ points_world = self.parser.points[point_indices]
+ points_cam = (worldtocams[:3, :3] @ points_world.T + worldtocams[:3, 3:4]).T
+ points_proj = (K @ points_cam.T).T
+ points = points_proj[:, :2] / points_proj[:, 2:3] # (M, 2)
+ depths = points_cam[:, 2] # (M,)
+ if self.patch_size is not None:
+ points[:, 0] -= x
+ points[:, 1] -= y
+ # filter out points outside the image
+ selector = (
+ (points[:, 0] >= 0)
+ & (points[:, 0] < image.shape[1])
+ & (points[:, 1] >= 0)
+ & (points[:, 1] < image.shape[0])
+ & (depths > 0)
+ )
+ points = points[selector]
+ depths = depths[selector]
+ data["points"] = torch.from_numpy(points).float()
+ data["depths"] = torch.from_numpy(depths).float()
+ if self.load_mono_disps:
+ data["mono_disps"] = torch.from_numpy(self.parser.mono_disps[index]).float() # type: ignore
+
+ return data
+
+
+def get_parser(parser_type: str, **kwargs) -> BaseParser:
+ if parser_type == "colmap":
+ parser = COLMAPParser(**kwargs)
+ elif parser_type == "direct":
+ parser = DirectParser(**kwargs)
+ elif parser_type == "reconfusion":
+ parser = ReconfusionParser(**kwargs)
+ else:
+ raise ValueError(f"Unknown parser type: {parser_type}")
+ return parser
diff --git a/seva/eval.py b/seva/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..48257f953c1465850dd76ff2c5bbb2a5b47e9d67
--- /dev/null
+++ b/seva/eval.py
@@ -0,0 +1,1990 @@
+import collections
+import json
+import math
+import os
+import re
+import threading
+from typing import List, Literal, Optional, Tuple, Union
+
+import gradio as gr
+from colorama import Fore, Style, init
+
+init(autoreset=True)
+
+import imageio.v3 as iio
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torchvision.transforms.functional as TF
+from einops import repeat
+from PIL import Image
+from tqdm.auto import tqdm
+
+from seva.geometry import get_camera_dist, get_plucker_coordinates, to_hom_pose
+from seva.sampling import (
+ EulerEDMSampler,
+ MultiviewCFG,
+ MultiviewTemporalCFG,
+ VanillaCFG,
+)
+from seva.utils import seed_everything
+
+try:
+ # Check if version string contains 'dev' or 'nightly'
+ version = torch.__version__
+ IS_TORCH_NIGHTLY = "dev" in version
+ if IS_TORCH_NIGHTLY:
+ torch._dynamo.config.cache_size_limit = 128 # type: ignore[assignment]
+ torch._dynamo.config.accumulated_cache_size_limit = 1024 # type: ignore[assignment]
+ torch._dynamo.config.force_parameter_static_shapes = False # type: ignore[assignment]
+except Exception:
+ IS_TORCH_NIGHTLY = False
+
+
+def pad_indices(
+ input_indices: List[int],
+ test_indices: List[int],
+ T: int,
+ padding_mode: Literal["first", "last", "none"] = "last",
+):
+ assert padding_mode in ["last", "none"], "`first` padding is not supported yet."
+ if padding_mode == "last":
+ padded_indices = [
+ i for i in range(T) if i not in (input_indices + test_indices)
+ ]
+ else:
+ padded_indices = []
+ input_selects = list(range(len(input_indices)))
+ test_selects = list(range(len(test_indices)))
+ if max(input_indices) > max(test_indices):
+ # last elem from input
+ input_selects += [input_selects[-1]] * len(padded_indices)
+ input_indices = input_indices + padded_indices
+ sorted_inds = np.argsort(input_indices)
+ input_indices = [input_indices[ind] for ind in sorted_inds]
+ input_selects = [input_selects[ind] for ind in sorted_inds]
+ else:
+ # last elem from test
+ test_selects += [test_selects[-1]] * len(padded_indices)
+ test_indices = test_indices + padded_indices
+ sorted_inds = np.argsort(test_indices)
+ test_indices = [test_indices[ind] for ind in sorted_inds]
+ test_selects = [test_selects[ind] for ind in sorted_inds]
+
+ if padding_mode == "last":
+ input_maps = np.array([-1] * T)
+ test_maps = np.array([-1] * T)
+ else:
+ input_maps = np.array([-1] * (len(input_indices) + len(test_indices)))
+ test_maps = np.array([-1] * (len(input_indices) + len(test_indices)))
+ input_maps[input_indices] = input_selects
+ test_maps[test_indices] = test_selects
+ return input_indices, test_indices, input_maps, test_maps
+
+
+def assemble(
+ input,
+ test,
+ input_maps,
+ test_maps,
+):
+ T = len(input_maps)
+ assembled = torch.zeros_like(test[-1:]).repeat_interleave(T, dim=0)
+ assembled[input_maps != -1] = input[input_maps[input_maps != -1]]
+ assembled[test_maps != -1] = test[test_maps[test_maps != -1]]
+ assert np.logical_xor(input_maps != -1, test_maps != -1).all()
+ return assembled
+
+
+def get_resizing_factor(
+ target_shape: Tuple[int, int], # H, W
+ current_shape: Tuple[int, int], # H, W
+ cover_target: bool = True,
+ # If True, the output shape will fully cover the target shape.
+ # If No, the target shape will fully cover the output shape.
+) -> float:
+ r_bound = target_shape[1] / target_shape[0]
+ aspect_r = current_shape[1] / current_shape[0]
+ if r_bound >= 1.0:
+ if cover_target:
+ if aspect_r >= r_bound:
+ factor = min(target_shape) / min(current_shape)
+ elif aspect_r < 1.0:
+ factor = max(target_shape) / min(current_shape)
+ else:
+ factor = max(target_shape) / max(current_shape)
+ else:
+ if aspect_r >= r_bound:
+ factor = max(target_shape) / max(current_shape)
+ elif aspect_r < 1.0:
+ factor = min(target_shape) / max(current_shape)
+ else:
+ factor = min(target_shape) / min(current_shape)
+ else:
+ if cover_target:
+ if aspect_r <= r_bound:
+ factor = min(target_shape) / min(current_shape)
+ elif aspect_r > 1.0:
+ factor = max(target_shape) / min(current_shape)
+ else:
+ factor = max(target_shape) / max(current_shape)
+ else:
+ if aspect_r <= r_bound:
+ factor = max(target_shape) / max(current_shape)
+ elif aspect_r > 1.0:
+ factor = min(target_shape) / max(current_shape)
+ else:
+ factor = min(target_shape) / min(current_shape)
+ return factor
+
+
+def get_unique_embedder_keys_from_conditioner(conditioner):
+ keys = [x.input_key for x in conditioner.embedders if x.input_key is not None]
+ keys = [item for sublist in keys for item in sublist] # Flatten list
+ return set(keys)
+
+
+def get_wh_with_fixed_shortest_side(w, h, size):
+ # size is smaller or equal to zero, we return original w h
+ if size is None or size <= 0:
+ return w, h
+ if w < h:
+ new_w = size
+ new_h = int(size * h / w)
+ else:
+ new_h = size
+ new_w = int(size * w / h)
+ return new_w, new_h
+
+
+def load_img_and_K(
+ image_path_or_size: Union[str, torch.Size],
+ size: Optional[Union[int, Tuple[int, int]]],
+ scale: float = 1.0,
+ center: Tuple[float, float] = (0.5, 0.5),
+ K: torch.Tensor | None = None,
+ size_stride: int = 1,
+ center_crop: bool = False,
+ image_as_tensor: bool = True,
+ context_rgb: np.ndarray | None = None,
+ device: str = "cuda",
+):
+ if isinstance(image_path_or_size, torch.Size):
+ image = Image.new("RGBA", image_path_or_size[::-1])
+ else:
+ image = Image.open(image_path_or_size).convert("RGBA")
+
+ w, h = image.size
+ if size is None:
+ size = (w, h)
+
+ image = np.array(image).astype(np.float32) / 255
+ if image.shape[-1] == 4:
+ rgb, alpha = image[:, :, :3], image[:, :, 3:]
+ if context_rgb is not None:
+ image = rgb * alpha + context_rgb * (1 - alpha)
+ else:
+ image = rgb * alpha + (1 - alpha)
+ image = image.transpose(2, 0, 1)
+ image = torch.from_numpy(image).to(dtype=torch.float32)
+ image = image.unsqueeze(0)
+
+ if isinstance(size, (tuple, list)):
+ # => if size is a tuple or list, we first rescale to fully cover the `size`
+ # area and then crop the `size` area from the rescale image
+ W, H = size
+ else:
+ # => if size is int, we rescale the image to fit the shortest side to size
+ # => if size is None, no rescaling is applied
+ W, H = get_wh_with_fixed_shortest_side(w, h, size)
+ W, H = (
+ math.floor(W / size_stride + 0.5) * size_stride,
+ math.floor(H / size_stride + 0.5) * size_stride,
+ )
+
+ rfs = get_resizing_factor((math.floor(H * scale), math.floor(W * scale)), (h, w))
+ resize_size = rh, rw = [int(np.ceil(rfs * s)) for s in (h, w)]
+ image = torch.nn.functional.interpolate(
+ image, resize_size, mode="area", antialias=False
+ )
+ if scale < 1.0:
+ pw = math.ceil((W - resize_size[1]) * 0.5)
+ ph = math.ceil((H - resize_size[0]) * 0.5)
+ image = F.pad(image, (pw, pw, ph, ph), "constant", 1.0)
+
+ cy_center = int(center[1] * image.shape[-2])
+ cx_center = int(center[0] * image.shape[-1])
+ if center_crop:
+ side = min(H, W)
+ ct = max(0, cy_center - side // 2)
+ cl = max(0, cx_center - side // 2)
+ ct = min(ct, image.shape[-2] - side)
+ cl = min(cl, image.shape[-1] - side)
+ image = TF.crop(image, top=ct, left=cl, height=side, width=side)
+ else:
+ ct = max(0, cy_center - H // 2)
+ cl = max(0, cx_center - W // 2)
+ ct = min(ct, image.shape[-2] - H)
+ cl = min(cl, image.shape[-1] - W)
+ image = TF.crop(image, top=ct, left=cl, height=H, width=W)
+
+ if K is not None:
+ K = K.clone()
+ if torch.all(K[:2, -1] >= 0) and torch.all(K[:2, -1] <= 1):
+ K[:2] *= K.new_tensor([rw, rh])[:, None] # normalized K
+ else:
+ K[:2] *= K.new_tensor([rw / w, rh / h])[:, None] # unnormalized K
+ K[:2, 2] -= K.new_tensor([cl, ct])
+
+ if image_as_tensor:
+ # tensor of shape (1, 3, H, W) with values ranging from (-1, 1)
+ image = image.to(device) * 2.0 - 1.0
+ else:
+ # PIL Image with values ranging from (0, 255)
+ image = image.permute(0, 2, 3, 1).numpy()[0]
+ image = Image.fromarray((image * 255).astype(np.uint8))
+ return image, K
+
+
+def transform_img_and_K(
+ image: torch.Tensor,
+ size: Union[int, Tuple[int, int]],
+ scale: float = 1.0,
+ center: Tuple[float, float] = (0.5, 0.5),
+ K: torch.Tensor | None = None,
+ size_stride: int = 1,
+ mode: str = "crop",
+):
+ assert mode in [
+ "crop",
+ "pad",
+ "stretch",
+ ], f"mode should be one of ['crop', 'pad', 'stretch'], got {mode}"
+
+ h, w = image.shape[-2:]
+ if isinstance(size, (tuple, list)):
+ # => if size is a tuple or list, we first rescale to fully cover the `size`
+ # area and then crop the `size` area from the rescale image
+ W, H = size
+ else:
+ # => if size is int, we rescale the image to fit the shortest side to size
+ # => if size is None, no rescaling is applied
+ W, H = get_wh_with_fixed_shortest_side(w, h, size)
+ W, H = (
+ math.floor(W / size_stride + 0.5) * size_stride,
+ math.floor(H / size_stride + 0.5) * size_stride,
+ )
+
+ if mode == "stretch":
+ rh, rw = H, W
+ else:
+ rfs = get_resizing_factor(
+ (H, W),
+ (h, w),
+ cover_target=mode != "pad",
+ )
+ (rh, rw) = [int(np.ceil(rfs * s)) for s in (h, w)]
+
+ rh, rw = int(rh / scale), int(rw / scale)
+ image = torch.nn.functional.interpolate(
+ image, (rh, rw), mode="area", antialias=False
+ )
+
+ cy_center = int(center[1] * image.shape[-2])
+ cx_center = int(center[0] * image.shape[-1])
+ if mode != "pad":
+ ct = max(0, cy_center - H // 2)
+ cl = max(0, cx_center - W // 2)
+ ct = min(ct, image.shape[-2] - H)
+ cl = min(cl, image.shape[-1] - W)
+ image = TF.crop(image, top=ct, left=cl, height=H, width=W)
+ pl, pt = 0, 0
+ else:
+ pt = max(0, H // 2 - cy_center)
+ pl = max(0, W // 2 - cx_center)
+ pb = max(0, H - pt - image.shape[-2])
+ pr = max(0, W - pl - image.shape[-1])
+ image = TF.pad(
+ image,
+ [pl, pt, pr, pb],
+ )
+ cl, ct = 0, 0
+
+ if K is not None:
+ K = K.clone()
+ # K[:, :2, 2] += K.new_tensor([pl, pt])
+ if torch.all(K[:, :2, -1] >= 0) and torch.all(K[:, :2, -1] <= 1):
+ K[:, :2] *= K.new_tensor([rw, rh])[None, :, None] # normalized K
+ else:
+ K[:, :2] *= K.new_tensor([rw / w, rh / h])[None, :, None] # unnormalized K
+ K[:, :2, 2] += K.new_tensor([pl - cl, pt - ct])
+
+ return image, K
+
+
+lowvram_mode = False
+
+
+def set_lowvram_mode(mode):
+ global lowvram_mode
+ lowvram_mode = mode
+
+
+def load_model(model, device: str = "cuda"):
+ model.to(device)
+
+
+def unload_model(model):
+ global lowvram_mode
+ if lowvram_mode:
+ model.cpu()
+ torch.cuda.empty_cache()
+
+
+def infer_prior_stats(
+ T,
+ num_input_frames,
+ num_total_frames,
+ version_dict,
+):
+ options = version_dict["options"]
+ chunk_strategy = options.get("chunk_strategy", "nearest")
+ T_first_pass = T[0] if isinstance(T, (list, tuple)) else T
+ T_second_pass = T[1] if isinstance(T, (list, tuple)) else T
+ # get traj_prior_c2ws for 2-pass sampling
+ if chunk_strategy.startswith("interp"):
+ # Start and end have alreay taken up two slots
+ # +1 means we need X + 1 prior frames to bound X times forwards for all test frames
+
+ # Tuning up `num_prior_frames_ratio` is helpful when you observe sudden jump in the
+ # generated frames due to insufficient prior frames. This option is effective for
+ # complicated trajectory and when `interp` strategy is used (usually semi-dense-view
+ # regime). Recommended range is [1.0 (default), 1.5].
+ if num_input_frames >= options.get("num_input_semi_dense", 9):
+ num_prior_frames = (
+ math.ceil(
+ num_total_frames
+ / (T_second_pass - 2)
+ * options.get("num_prior_frames_ratio", 1.0)
+ )
+ + 1
+ )
+
+ if num_prior_frames + num_input_frames < T_first_pass:
+ num_prior_frames = T_first_pass - num_input_frames
+
+ num_prior_frames = max(
+ num_prior_frames,
+ options.get("num_prior_frames", 0),
+ )
+
+ T_first_pass = num_prior_frames + num_input_frames
+
+ if "gt" in chunk_strategy:
+ T_second_pass = T_second_pass + num_input_frames
+
+ # Dynamically update context window length.
+ version_dict["T"] = [T_first_pass, T_second_pass]
+
+ else:
+ num_prior_frames = (
+ math.ceil(
+ num_total_frames
+ / (
+ T_second_pass
+ - 2
+ - (num_input_frames if "gt" in chunk_strategy else 0)
+ )
+ * options.get("num_prior_frames_ratio", 1.0)
+ )
+ + 1
+ )
+
+ if num_prior_frames + num_input_frames < T_first_pass:
+ num_prior_frames = T_first_pass - num_input_frames
+
+ num_prior_frames = max(
+ num_prior_frames,
+ options.get("num_prior_frames", 0),
+ )
+ else:
+ num_prior_frames = max(
+ T_first_pass - num_input_frames,
+ options.get("num_prior_frames", 0),
+ )
+
+ if num_input_frames >= options.get("num_input_semi_dense", 9):
+ T_first_pass = num_prior_frames + num_input_frames
+
+ # Dynamically update context window length.
+ version_dict["T"] = [T_first_pass, T_second_pass]
+
+ return num_prior_frames
+
+
+def infer_prior_inds(
+ c2ws,
+ num_prior_frames,
+ input_frame_indices,
+ options,
+):
+ chunk_strategy = options.get("chunk_strategy", "nearest")
+ if chunk_strategy.startswith("interp"):
+ prior_frame_indices = np.array(
+ [i for i in range(c2ws.shape[0]) if i not in input_frame_indices]
+ )
+ prior_frame_indices = prior_frame_indices[
+ np.ceil(
+ np.linspace(
+ 0, prior_frame_indices.shape[0] - 1, num_prior_frames, endpoint=True
+ )
+ ).astype(int)
+ ] # having a ceil here is actually safer for corner case
+ else:
+ prior_frame_indices = []
+ while len(prior_frame_indices) < num_prior_frames:
+ closest_distance = np.abs(
+ np.arange(c2ws.shape[0])[None]
+ - np.concatenate(
+ [np.array(input_frame_indices), np.array(prior_frame_indices)]
+ )[:, None]
+ ).min(0)
+ prior_frame_indices.append(np.argsort(closest_distance)[-1])
+ return np.sort(prior_frame_indices)
+
+
+def compute_relative_inds(
+ source_inds,
+ target_inds,
+):
+ assert len(source_inds) > 2
+ # compute relative indices of target_inds within source_inds
+ relative_inds = []
+ for ind in target_inds:
+ if ind in source_inds:
+ relative_ind = int(np.where(source_inds == ind)[0][0])
+ elif ind < source_inds[0]:
+ # extrapolate
+ relative_ind = -((source_inds[0] - ind) / (source_inds[1] - source_inds[0]))
+ elif ind > source_inds[-1]:
+ # extrapolate
+ relative_ind = len(source_inds) + (
+ (ind - source_inds[-1]) / (source_inds[-1] - source_inds[-2])
+ )
+ else:
+ # interpolate
+ lower_inds = source_inds[source_inds < ind]
+ upper_inds = source_inds[source_inds > ind]
+ if len(lower_inds) > 0 and len(upper_inds) > 0:
+ lower_ind = lower_inds[-1]
+ upper_ind = upper_inds[0]
+ relative_lower_ind = int(np.where(source_inds == lower_ind)[0][0])
+ relative_upper_ind = int(np.where(source_inds == upper_ind)[0][0])
+ relative_ind = relative_lower_ind + (ind - lower_ind) / (
+ upper_ind - lower_ind
+ ) * (relative_upper_ind - relative_lower_ind)
+ else:
+ # Out of range
+ relative_inds.append(float("nan")) # Or some other placeholder
+ relative_inds.append(relative_ind)
+ return relative_inds
+
+
+def find_nearest_source_inds(
+ source_c2ws,
+ target_c2ws,
+ nearest_num=1,
+ mode="translation",
+):
+ dists = get_camera_dist(source_c2ws, target_c2ws, mode=mode).cpu().numpy()
+ sorted_inds = np.argsort(dists, axis=0).T
+ return sorted_inds[:, :nearest_num]
+
+
+def chunk_input_and_test(
+ T,
+ input_c2ws,
+ test_c2ws,
+ input_ords, # orders
+ test_ords, # orders
+ options,
+ task: str = "img2img",
+ chunk_strategy: str = "gt",
+ gt_input_inds: list = [],
+):
+ M, N = input_c2ws.shape[0], test_c2ws.shape[0]
+
+ chunks = []
+ if chunk_strategy.startswith("gt"):
+ assert len(gt_input_inds) < T, (
+ f"Number of gt input frames {len(gt_input_inds)} should be "
+ f"less than {T} when `gt` chunking strategy is used."
+ )
+ assert (
+ list(range(M)) == gt_input_inds
+ ), "All input_c2ws should be gt when `gt` chunking strategy is used."
+
+ # LEGACY CHUNKING STRATEGY
+ # num_test_per_chunk = T - len(gt_input_inds)
+ # test_inds_per_chunk = [i for i in range(T) if i not in gt_input_inds]
+ # for i in range(0, test_c2ws.shape[0], num_test_per_chunk):
+ # chunk = ["NULL"] * T
+ # for j, k in enumerate(gt_input_inds):
+ # chunk[k] = f"!{j:03d}"
+ # for j, k in enumerate(
+ # test_inds_per_chunk[: test_c2ws[i : i + num_test_per_chunk].shape[0]]
+ # ):
+ # chunk[k] = f">{i + j:03d}"
+ # chunks.append(chunk)
+
+ num_test_seen = 0
+ while num_test_seen < N:
+ chunk = [f"!{i:03d}" for i in gt_input_inds]
+ if chunk_strategy != "gt" and num_test_seen > 0:
+ pseudo_num_ratio = options.get("pseudo_num_ratio", 0.33)
+ if (N - num_test_seen) >= math.floor(
+ (T - len(gt_input_inds)) * pseudo_num_ratio
+ ):
+ pseudo_num = math.ceil((T - len(gt_input_inds)) * pseudo_num_ratio)
+ else:
+ pseudo_num = (T - len(gt_input_inds)) - (N - num_test_seen)
+ pseudo_num = min(pseudo_num, options.get("pseudo_num_max", 10000))
+
+ if "ltr" in chunk_strategy:
+ chunk.extend(
+ [
+ f"!{i + len(gt_input_inds):03d}"
+ for i in range(num_test_seen - pseudo_num, num_test_seen)
+ ]
+ )
+ elif "nearest" in chunk_strategy:
+ source_inds = np.concatenate(
+ [
+ find_nearest_source_inds(
+ test_c2ws[:num_test_seen],
+ test_c2ws[num_test_seen:],
+ nearest_num=1, # pseudo_num,
+ mode="rotation",
+ ),
+ find_nearest_source_inds(
+ test_c2ws[:num_test_seen],
+ test_c2ws[num_test_seen:],
+ nearest_num=1, # pseudo_num,
+ mode="translation",
+ ),
+ ],
+ axis=1,
+ )
+ ####### [HACK ALERT] keep running until pseudo num is stablized ########
+ temp_pseudo_num = pseudo_num
+ while True:
+ nearest_source_inds = np.concatenate(
+ [
+ np.sort(
+ [
+ ind
+ for (ind, _) in collections.Counter(
+ [
+ item
+ for item in source_inds[
+ : T
+ - len(gt_input_inds)
+ - temp_pseudo_num
+ ]
+ .flatten()
+ .tolist()
+ if item
+ != (
+ num_test_seen - 1
+ ) # exclude the last one here
+ ]
+ ).most_common(pseudo_num - 1)
+ ],
+ ).astype(int),
+ [num_test_seen - 1], # always keep the last one
+ ]
+ )
+ if len(nearest_source_inds) >= temp_pseudo_num:
+ break # stablized
+ else:
+ temp_pseudo_num = len(nearest_source_inds)
+ pseudo_num = len(nearest_source_inds)
+ ########################################################################
+ chunk.extend(
+ [f"!{i + len(gt_input_inds):03d}" for i in nearest_source_inds]
+ )
+ else:
+ raise NotImplementedError(
+ f"Chunking strategy {chunk_strategy} for the first pass is not implemented."
+ )
+
+ chunk.extend(
+ [
+ f">{i:03d}"
+ for i in range(
+ num_test_seen,
+ min(num_test_seen + T - len(gt_input_inds) - pseudo_num, N),
+ )
+ ]
+ )
+ else:
+ chunk.extend(
+ [
+ f">{i:03d}"
+ for i in range(
+ num_test_seen,
+ min(num_test_seen + T - len(gt_input_inds), N),
+ )
+ ]
+ )
+
+ num_test_seen += sum([1 for c in chunk if c.startswith(">")])
+ if len(chunk) < T:
+ chunk.extend(["NULL"] * (T - len(chunk)))
+ chunks.append(chunk)
+
+ elif chunk_strategy.startswith("nearest"):
+ input_imgs = np.array([f"!{i:03d}" for i in range(M)])
+ test_imgs = np.array([f">{i:03d}" for i in range(N)])
+
+ match = re.match(r"^nearest-(\d+)$", chunk_strategy)
+ if match:
+ nearest_num = int(match.group(1))
+ assert (
+ nearest_num < T
+ ), f"Nearest number of {nearest_num} should be less than {T}."
+ source_inds = find_nearest_source_inds(
+ input_c2ws,
+ test_c2ws,
+ nearest_num=nearest_num,
+ mode="translation", # during the second pass, consider translation only is enough
+ )
+
+ for i in range(0, N, T - nearest_num):
+ nearest_source_inds = np.sort(
+ [
+ ind
+ for (ind, _) in collections.Counter(
+ source_inds[i : i + T - nearest_num].flatten().tolist()
+ ).most_common(nearest_num)
+ ]
+ )
+ chunk = (
+ input_imgs[nearest_source_inds].tolist()
+ + test_imgs[i : i + T - nearest_num].tolist()
+ )
+ chunks.append(chunk + ["NULL"] * (T - len(chunk)))
+
+ else:
+ # do not always condition on gt cond frames
+ if "gt" not in chunk_strategy:
+ gt_input_inds = []
+
+ source_inds = find_nearest_source_inds(
+ input_c2ws,
+ test_c2ws,
+ nearest_num=1,
+ mode="translation", # during the second pass, consider translation only is enough
+ )[:, 0]
+
+ test_inds_per_input = {}
+ for test_idx, input_idx in enumerate(source_inds):
+ if input_idx not in test_inds_per_input:
+ test_inds_per_input[input_idx] = []
+ test_inds_per_input[input_idx].append(test_idx)
+
+ num_test_seen = 0
+ chunk = input_imgs[gt_input_inds].tolist()
+ candidate_input_inds = sorted(list(test_inds_per_input.keys()))
+
+ while num_test_seen < N:
+ input_idx = candidate_input_inds[0]
+ test_inds = test_inds_per_input[input_idx]
+ input_is_cond = input_idx in gt_input_inds
+ prefix_inds = [] if input_is_cond else [input_idx]
+
+ if len(chunk) == T - len(prefix_inds) or not candidate_input_inds:
+ if chunk:
+ chunk += ["NULL"] * (T - len(chunk))
+ chunks.append(chunk)
+ chunk = input_imgs[gt_input_inds].tolist()
+ if num_test_seen >= N:
+ break
+ continue
+
+ candidate_chunk = (
+ input_imgs[prefix_inds].tolist() + test_imgs[test_inds].tolist()
+ )
+
+ space_left = T - len(chunk)
+ if len(candidate_chunk) <= space_left:
+ chunk.extend(candidate_chunk)
+ num_test_seen += len(test_inds)
+ candidate_input_inds.pop(0)
+ else:
+ chunk.extend(candidate_chunk[:space_left])
+ num_input_idx = 0 if input_is_cond else 1
+ num_test_seen += space_left - num_input_idx
+ test_inds_per_input[input_idx] = test_inds[
+ space_left - num_input_idx :
+ ]
+
+ if len(chunk) == T:
+ chunks.append(chunk)
+ chunk = input_imgs[gt_input_inds].tolist()
+
+ if chunk and chunk != input_imgs[gt_input_inds].tolist():
+ chunks.append(chunk + ["NULL"] * (T - len(chunk)))
+
+ elif chunk_strategy.startswith("interp"):
+ # `interp` chunk requires ordering info
+ assert input_ords is not None and test_ords is not None, (
+ "When using `interp` chunking strategy, ordering of input "
+ "and test frames should be provided."
+ )
+
+ # if chunk_strategy is `interp*`` and task is `img2trajvid*`, we will not
+ # use input views since their order info within target views is unknown
+ if "img2trajvid" in task:
+ assert (
+ list(range(len(gt_input_inds))) == gt_input_inds
+ ), "`img2trajvid` task should put `gt_input_inds` in start."
+ input_c2ws = input_c2ws[
+ [ind for ind in range(M) if ind not in gt_input_inds]
+ ]
+ input_ords = [
+ input_ords[ind] for ind in range(M) if ind not in gt_input_inds
+ ]
+ M = input_c2ws.shape[0]
+
+ input_ords = [0] + input_ords # this is a hack accounting for test views
+ # before the first input view
+ input_ords[-1] += 0.01 # this is a hack ensuring last test stop is included
+ # in the last forward when input_ords[-1] == test_ords[-1]
+ input_ords = np.array(input_ords)[:, None]
+ input_ords_ = np.concatenate([input_ords[1:], np.full((1, 1), np.inf)])
+ test_ords = np.array(test_ords)[None]
+
+ in_stop_ranges = np.logical_and(
+ np.repeat(input_ords, N, axis=1) <= np.repeat(test_ords, M + 1, axis=0),
+ np.repeat(input_ords_, N, axis=1) > np.repeat(test_ords, M + 1, axis=0),
+ ) # (M, N)
+ assert (in_stop_ranges.sum(1) <= T - 2).all(), (
+ "More input frames need to be sampled during the first pass to ensure "
+ f"#test frames during each forard in the second pass will not exceed {T - 2}."
+ )
+ if input_ords[1, 0] <= test_ords[0, 0]:
+ assert not in_stop_ranges[0].any()
+ if input_ords[-1, 0] >= test_ords[0, -1]:
+ assert not in_stop_ranges[-1].any()
+
+ gt_chunk = (
+ [f"!{i:03d}" for i in gt_input_inds] if "gt" in chunk_strategy else []
+ )
+ chunk = gt_chunk + []
+ # any test views before the first input views
+ if in_stop_ranges[0].any():
+ for j, in_range in enumerate(in_stop_ranges[0]):
+ if in_range:
+ chunk.append(f">{j:03d}")
+ in_stop_ranges = in_stop_ranges[1:]
+
+ i = 0
+ base_i = len(gt_input_inds) if "img2trajvid" in task else 0
+ chunk.append(f"!{i + base_i:03d}")
+ while i < len(in_stop_ranges):
+ in_stop_range = in_stop_ranges[i]
+ if not in_stop_range.any():
+ i += 1
+ continue
+
+ input_left = i + 1 < M
+ space_left = T - len(chunk)
+ if sum(in_stop_range) + input_left <= space_left:
+ for j, in_range in enumerate(in_stop_range):
+ if in_range:
+ chunk.append(f">{j:03d}")
+ i += 1
+ if input_left:
+ chunk.append(f"!{i + base_i:03d}")
+
+ else:
+ chunk += ["NULL"] * space_left
+ chunks.append(chunk)
+ chunk = gt_chunk + [f"!{i + base_i:03d}"]
+
+ if len(chunk) > 1:
+ chunk += ["NULL"] * (T - len(chunk))
+ chunks.append(chunk)
+
+ else:
+ raise NotImplementedError
+
+ (
+ input_inds_per_chunk,
+ input_sels_per_chunk,
+ test_inds_per_chunk,
+ test_sels_per_chunk,
+ ) = (
+ [],
+ [],
+ [],
+ [],
+ )
+ for chunk in chunks:
+ input_inds = [
+ int(img.removeprefix("!")) for img in chunk if img.startswith("!")
+ ]
+ input_sels = [chunk.index(img) for img in chunk if img.startswith("!")]
+ test_inds = [int(img.removeprefix(">")) for img in chunk if img.startswith(">")]
+ test_sels = [chunk.index(img) for img in chunk if img.startswith(">")]
+ input_inds_per_chunk.append(input_inds)
+ input_sels_per_chunk.append(input_sels)
+ test_inds_per_chunk.append(test_inds)
+ test_sels_per_chunk.append(test_sels)
+
+ if options.get("sampler_verbose", True):
+
+ def colorize(item):
+ if item.startswith("!"):
+ return f"{Fore.RED}{item}{Style.RESET_ALL}" # Red for items starting with '!'
+ elif item.startswith(">"):
+ return f"{Fore.GREEN}{item}{Style.RESET_ALL}" # Green for items starting with '>'
+ return item # Default color if neither '!' nor '>'
+
+ print("\nchunks:")
+ for chunk in chunks:
+ print(", ".join(colorize(item) for item in chunk))
+
+ return (
+ chunks,
+ input_inds_per_chunk, # ordering of input in raw sequence
+ input_sels_per_chunk, # ordering of input in one-forward sequence of length T
+ test_inds_per_chunk, # ordering of test in raw sequence
+ test_sels_per_chunk, # oredering of test in one-forward sequence of length T
+ )
+
+
+def is_k_in_dict(d, k):
+ return any(map(lambda x: x.startswith(k), d.keys()))
+
+
+def get_k_from_dict(d, k):
+ media_d = {}
+ for key, value in d.items():
+ if key == k:
+ return value
+ if key.startswith(k):
+ media = key.split("/")[-1]
+ if media == "raw":
+ return value
+ media_d[media] = value
+ if len(media_d) == 0:
+ return torch.tensor([])
+ assert (
+ len(media_d) == 1
+ ), f"multiple media found in {d} for key {k}: {media_d.keys()}"
+ return media_d[media]
+
+
+def update_kv_for_dict(d, k, v):
+ for key in d.keys():
+ if key.startswith(k):
+ d[key] = v
+ return d
+
+
+def extend_dict(ds, d):
+ for key in d.keys():
+ if key in ds:
+ ds[key] = torch.cat([ds[key], d[key]], 0)
+ else:
+ ds[key] = d[key]
+ return ds
+
+
+def replace_or_include_input_for_dict(
+ samples,
+ test_indices,
+ imgs,
+ c2w,
+ K,
+):
+ samples_new = {}
+ for sample, value in samples.items():
+ if "rgb" in sample:
+ imgs[test_indices] = (
+ value[test_indices] if value.shape[0] == imgs.shape[0] else value
+ ).to(device=imgs.device, dtype=imgs.dtype)
+ samples_new[sample] = imgs
+ elif "c2w" in sample:
+ c2w[test_indices] = (
+ value[test_indices] if value.shape[0] == c2w.shape[0] else value
+ ).to(device=c2w.device, dtype=c2w.dtype)
+ samples_new[sample] = c2w
+ elif "intrinsics" in sample:
+ K[test_indices] = (
+ value[test_indices] if value.shape[0] == K.shape[0] else value
+ ).to(device=K.device, dtype=K.dtype)
+ samples_new[sample] = K
+ else:
+ samples_new[sample] = value
+ return samples_new
+
+
+def decode_output(
+ samples,
+ T,
+ indices=None,
+):
+ # decode model output into dict if it is not
+ if isinstance(samples, dict):
+ # model with postprocessor and outputs dict
+ for sample, value in samples.items():
+ if isinstance(value, torch.Tensor):
+ value = value.detach().cpu()
+ elif isinstance(value, np.ndarray):
+ value = torch.from_numpy(value)
+ else:
+ value = torch.tensor(value)
+
+ if indices is not None and value.shape[0] == T:
+ value = value[indices]
+ samples[sample] = value
+ else:
+ # model without postprocessor and outputs tensor (rgb)
+ samples = samples.detach().cpu()
+
+ if indices is not None and samples.shape[0] == T:
+ samples = samples[indices]
+ samples = {"samples-rgb/image": samples}
+
+ return samples
+
+
+def save_output(
+ samples,
+ save_path,
+ video_save_fps=2,
+):
+ os.makedirs(save_path, exist_ok=True)
+ for sample in samples:
+ media_type = "video"
+ if "/" in sample:
+ sample_, media_type = sample.split("/")
+ else:
+ sample_ = sample
+
+ value = samples[sample]
+ if isinstance(value, torch.Tensor):
+ value = value.detach().cpu()
+ elif isinstance(value, np.ndarray):
+ value = torch.from_numpy(value)
+ else:
+ value = torch.tensor(value)
+
+ if media_type == "image":
+ value = (value.permute(0, 2, 3, 1) + 1) / 2.0
+ value = (value * 255).clamp(0, 255).to(torch.uint8)
+ iio.imwrite(
+ os.path.join(save_path, f"{sample_}.mp4")
+ if sample_
+ else f"{save_path}.mp4",
+ value,
+ fps=video_save_fps,
+ macro_block_size=1,
+ ffmpeg_log_level="error",
+ )
+ os.makedirs(os.path.join(save_path, sample_), exist_ok=True)
+ for i, s in enumerate(value):
+ iio.imwrite(
+ os.path.join(save_path, sample_, f"{i:03d}.png"),
+ s,
+ )
+ elif media_type == "video":
+ value = (value.permute(0, 2, 3, 1) + 1) / 2.0
+ value = (value * 255).clamp(0, 255).to(torch.uint8)
+ iio.imwrite(
+ os.path.join(save_path, f"{sample_}.mp4"),
+ value,
+ fps=video_save_fps,
+ macro_block_size=1,
+ ffmpeg_log_level="error",
+ )
+ elif media_type == "raw":
+ torch.save(
+ value,
+ os.path.join(save_path, f"{sample_}.pt"),
+ )
+ else:
+ pass
+
+
+def create_transforms_simple(save_path, img_paths, img_whs, c2ws, Ks):
+ import os.path as osp
+
+ out_frames = []
+ for img_path, img_wh, c2w, K in zip(img_paths, img_whs, c2ws, Ks):
+ out_frame = {
+ "fl_x": K[0][0].item(),
+ "fl_y": K[1][1].item(),
+ "cx": K[0][2].item(),
+ "cy": K[1][2].item(),
+ "w": img_wh[0].item(),
+ "h": img_wh[1].item(),
+ "file_path": f"./{osp.relpath(img_path, start=save_path)}"
+ if img_path is not None
+ else None,
+ "transform_matrix": c2w.tolist(),
+ }
+ out_frames.append(out_frame)
+ out = {
+ # "camera_model": "PINHOLE",
+ "orientation_override": "none",
+ "frames": out_frames,
+ }
+ with open(osp.join(save_path, "transforms.json"), "w") as of:
+ json.dump(out, of, indent=5)
+
+
+class GradioTrackedSampler(EulerEDMSampler):
+ """
+ A thin wrapper around the EulerEDMSampler that allows tracking progress and
+ aborting sampling for gradio demo.
+ """
+
+ def __init__(self, abort_event: threading.Event, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+ self.abort_event = abort_event
+
+ def __call__( # type: ignore
+ self,
+ denoiser,
+ x: torch.Tensor,
+ scale: float | torch.Tensor,
+ cond: dict,
+ uc: dict | None = None,
+ num_steps: int | None = None,
+ verbose: bool = True,
+ global_pbar: gr.Progress | None = None,
+ **guider_kwargs,
+ ) -> torch.Tensor | None:
+ uc = cond if uc is None else uc
+ x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
+ x,
+ cond,
+ uc,
+ num_steps,
+ )
+ for i in self.get_sigma_gen(num_sigmas, verbose=verbose):
+ gamma = (
+ min(self.s_churn / (num_sigmas - 1), 2**0.5 - 1)
+ if self.s_tmin <= sigmas[i] <= self.s_tmax
+ else 0.0
+ )
+ x = self.sampler_step(
+ s_in * sigmas[i],
+ s_in * sigmas[i + 1],
+ denoiser,
+ x,
+ scale,
+ cond,
+ uc,
+ gamma,
+ **guider_kwargs,
+ )
+ # Allow tracking progress in gradio demo.
+ if global_pbar is not None:
+ global_pbar.update()
+ # Allow aborting sampling in gradio demo.
+ if self.abort_event.is_set():
+ return None
+ return x
+
+
+def create_samplers(
+ guider_types: int | list[int],
+ discretization,
+ num_frames: list[int] | None,
+ num_steps: int,
+ cfg_min: float = 1.0,
+ device: str | torch.device = "cuda",
+ abort_event: threading.Event | None = None,
+):
+ guider_mapping = {
+ 0: VanillaCFG,
+ 1: MultiviewCFG,
+ 2: MultiviewTemporalCFG,
+ }
+ samplers = []
+ if not isinstance(guider_types, (list, tuple)):
+ guider_types = [guider_types]
+ for i, guider_type in enumerate(guider_types):
+ if guider_type not in guider_mapping:
+ raise ValueError(
+ f"Invalid guider type {guider_type}. Must be one of {list(guider_mapping.keys())}"
+ )
+ guider_cls = guider_mapping[guider_type]
+ guider_args = ()
+ if guider_type > 0:
+ guider_args += (cfg_min,)
+ if guider_type == 2:
+ assert num_frames is not None
+ guider_args = (num_frames[i], cfg_min)
+ guider = guider_cls(*guider_args)
+
+ if abort_event is not None:
+ sampler = GradioTrackedSampler(
+ abort_event,
+ discretization=discretization,
+ guider=guider,
+ num_steps=num_steps,
+ s_churn=0.0,
+ s_tmin=0.0,
+ s_tmax=999.0,
+ s_noise=1.0,
+ verbose=True,
+ device=device,
+ )
+ else:
+ sampler = EulerEDMSampler(
+ discretization=discretization,
+ guider=guider,
+ num_steps=num_steps,
+ s_churn=0.0,
+ s_tmin=0.0,
+ s_tmax=999.0,
+ s_noise=1.0,
+ verbose=True,
+ device=device,
+ )
+ samplers.append(sampler)
+ return samplers
+
+
+def get_value_dict(
+ curr_imgs,
+ curr_imgs_clip,
+ curr_input_frame_indices,
+ curr_c2ws,
+ curr_Ks,
+ curr_input_camera_indices,
+ all_c2ws,
+ camera_scale,
+):
+ assert sorted(curr_input_camera_indices) == sorted(
+ range(len(curr_input_camera_indices))
+ )
+ H, W, T, F = curr_imgs.shape[-2], curr_imgs.shape[-1], len(curr_imgs), 8
+
+ value_dict = {}
+ value_dict["cond_frames_without_noise"] = curr_imgs_clip[curr_input_frame_indices]
+ value_dict["cond_frames"] = curr_imgs + 0.0 * torch.randn_like(curr_imgs)
+ value_dict["cond_frames_mask"] = torch.zeros(T, dtype=torch.bool)
+ value_dict["cond_frames_mask"][curr_input_frame_indices] = True
+ value_dict["cond_aug"] = 0.0
+
+ c2w = to_hom_pose(curr_c2ws.float())
+ w2c = torch.linalg.inv(c2w)
+
+ # camera centering
+ ref_c2ws = all_c2ws
+ camera_dist_2med = torch.norm(
+ ref_c2ws[:, :3, 3] - ref_c2ws[:, :3, 3].median(0, keepdim=True).values,
+ dim=-1,
+ )
+ valid_mask = camera_dist_2med <= torch.clamp(
+ torch.quantile(camera_dist_2med, 0.97) * 10,
+ max=1e6,
+ )
+ c2w[:, :3, 3] -= ref_c2ws[valid_mask, :3, 3].mean(0, keepdim=True)
+ w2c = torch.linalg.inv(c2w)
+
+ # camera normalization
+ camera_dists = c2w[:, :3, 3].clone()
+ translation_scaling_factor = (
+ camera_scale
+ if torch.isclose(
+ torch.norm(camera_dists[0]),
+ torch.zeros(1),
+ atol=1e-5,
+ ).any()
+ else (camera_scale / torch.norm(camera_dists[0]))
+ )
+ w2c[:, :3, 3] *= translation_scaling_factor
+ c2w[:, :3, 3] *= translation_scaling_factor
+ value_dict["plucker_coordinate"], _ = get_plucker_coordinates(
+ extrinsics_src=w2c[0],
+ extrinsics=w2c,
+ intrinsics=curr_Ks.float().clone(),
+ mode="plucker",
+ rel_zero_translation=True,
+ target_size=(H // F, W // F),
+ return_grid_cam=True,
+ )
+
+ value_dict["c2w"] = c2w
+ value_dict["K"] = curr_Ks
+ value_dict["camera_mask"] = torch.zeros(T, dtype=torch.bool)
+ value_dict["camera_mask"][curr_input_camera_indices] = True
+
+ return value_dict
+
+
+def do_sample(
+ model,
+ ae,
+ conditioner,
+ denoiser,
+ sampler,
+ value_dict,
+ H,
+ W,
+ C,
+ F,
+ T,
+ cfg,
+ encoding_t=1,
+ decoding_t=1,
+ verbose=True,
+ global_pbar=None,
+ **_,
+):
+ imgs = value_dict["cond_frames"].to("cuda")
+ input_masks = value_dict["cond_frames_mask"].to("cuda")
+ pluckers = value_dict["plucker_coordinate"].to("cuda")
+
+ num_samples = [1, T]
+ with torch.inference_mode(), torch.autocast("cuda"):
+ load_model(ae)
+ load_model(conditioner)
+ latents = torch.nn.functional.pad(
+ ae.encode(imgs[input_masks], encoding_t), (0, 0, 0, 0, 0, 1), value=1.0
+ )
+ c_crossattn = repeat(conditioner(imgs[input_masks]).mean(0), "d -> n 1 d", n=T)
+ uc_crossattn = torch.zeros_like(c_crossattn)
+ c_replace = latents.new_zeros(T, *latents.shape[1:])
+ c_replace[input_masks] = latents
+ uc_replace = torch.zeros_like(c_replace)
+ c_concat = torch.cat(
+ [
+ repeat(
+ input_masks,
+ "n -> n 1 h w",
+ h=pluckers.shape[2],
+ w=pluckers.shape[3],
+ ),
+ pluckers,
+ ],
+ 1,
+ )
+ uc_concat = torch.cat(
+ [pluckers.new_zeros(T, 1, *pluckers.shape[-2:]), pluckers], 1
+ )
+ c_dense_vector = pluckers
+ uc_dense_vector = c_dense_vector
+ c = {
+ "crossattn": c_crossattn,
+ "replace": c_replace,
+ "concat": c_concat,
+ "dense_vector": c_dense_vector,
+ }
+ uc = {
+ "crossattn": uc_crossattn,
+ "replace": uc_replace,
+ "concat": uc_concat,
+ "dense_vector": uc_dense_vector,
+ }
+ unload_model(ae)
+ unload_model(conditioner)
+
+ additional_model_inputs = {"num_frames": T}
+ additional_sampler_inputs = {
+ "c2w": value_dict["c2w"].to("cuda"),
+ "K": value_dict["K"].to("cuda"),
+ "input_frame_mask": value_dict["cond_frames_mask"].to("cuda"),
+ }
+ if global_pbar is not None:
+ additional_sampler_inputs["global_pbar"] = global_pbar
+
+ shape = (math.prod(num_samples), C, H // F, W // F)
+ randn = torch.randn(shape).to("cuda")
+
+ load_model(model)
+ samples_z = sampler(
+ lambda input, sigma, c: denoiser(
+ model,
+ input,
+ sigma,
+ c,
+ **additional_model_inputs,
+ ),
+ randn,
+ scale=cfg,
+ cond=c,
+ uc=uc,
+ verbose=verbose,
+ **additional_sampler_inputs,
+ )
+ if samples_z is None:
+ return
+ unload_model(model)
+
+ load_model(ae)
+ samples = ae.decode(samples_z, decoding_t)
+ unload_model(ae)
+
+ return samples
+
+
+def run_one_scene(
+ task,
+ version_dict,
+ model,
+ ae,
+ conditioner,
+ denoiser,
+ image_cond,
+ camera_cond,
+ save_path,
+ use_traj_prior,
+ traj_prior_Ks,
+ traj_prior_c2ws,
+ seed=23,
+ gradio=False,
+ abort_event=None,
+ first_pass_pbar=None,
+ second_pass_pbar=None,
+):
+ H, W, T, C, F, options = (
+ version_dict["H"],
+ version_dict["W"],
+ version_dict["T"],
+ version_dict["C"],
+ version_dict["f"],
+ version_dict["options"],
+ )
+
+ if isinstance(image_cond, str):
+ image_cond = {"img": [image_cond]}
+ imgs_clip, imgs, img_size = [], [], None
+ for i, (img, K) in enumerate(zip(image_cond["img"], camera_cond["K"])):
+ if isinstance(img, str) or img is None:
+ img, K = load_img_and_K(img or img_size, None, K=K, device="cpu") # type: ignore
+ img_size = img.shape[-2:]
+ if options.get("L_short", -1) == -1:
+ img, K = transform_img_and_K(
+ img,
+ (W, H),
+ K=K[None],
+ mode=(
+ options.get("transform_input", "crop")
+ if i in image_cond["input_indices"]
+ else options.get("transform_target", "crop")
+ ),
+ scale=(
+ 1.0
+ if i in image_cond["input_indices"]
+ else options.get("transform_scale", 1.0)
+ ),
+ )
+ else:
+ downsample = 3
+ assert options["L_short"] % F * 2**downsample == 0, (
+ "Short side of the image should be divisible by "
+ f"F*2**{downsample}={F * 2**downsample}."
+ )
+ img, K = transform_img_and_K(
+ img,
+ options["L_short"],
+ K=K[None],
+ size_stride=F * 2**downsample,
+ mode=(
+ options.get("transform_input", "crop")
+ if i in image_cond["input_indices"]
+ else options.get("transform_target", "crop")
+ ),
+ scale=(
+ 1.0
+ if i in image_cond["input_indices"]
+ else options.get("transform_scale", 1.0)
+ ),
+ )
+ version_dict["W"] = W = img.shape[-1]
+ version_dict["H"] = H = img.shape[-2]
+ K = K[0]
+ K[0] /= W
+ K[1] /= H
+ camera_cond["K"][i] = K
+ img_clip = img
+ elif isinstance(img, np.ndarray):
+ img_size = torch.Size(img.shape[:2])
+ img = torch.as_tensor(img).permute(2, 0, 1)
+ img = img.unsqueeze(0)
+ img = img / 255.0 * 2.0 - 1.0
+ if not gradio:
+ img, K = transform_img_and_K(img, (W, H), K=K[None])
+ assert K is not None
+ K = K[0]
+ K[0] /= W
+ K[1] /= H
+ camera_cond["K"][i] = K
+ img_clip = img
+ else:
+ assert (
+ False
+ ), f"Variable `img` got {type(img)} type which is not supported!!!"
+ imgs_clip.append(img_clip)
+ imgs.append(img)
+ imgs_clip = torch.cat(imgs_clip, dim=0)
+ imgs = torch.cat(imgs, dim=0)
+
+ if traj_prior_Ks is not None:
+ assert img_size is not None
+ for i, prior_k in enumerate(traj_prior_Ks):
+ img, prior_k = load_img_and_K(img_size, None, K=prior_k, device="cpu") # type: ignore
+ img, prior_k = transform_img_and_K(
+ img,
+ (W, H),
+ K=prior_k[None],
+ mode=options.get(
+ "transform_target", "crop"
+ ), # mode for prior is always same as target
+ scale=options.get(
+ "transform_scale", 1.0
+ ), # scale for prior is always same as target
+ )
+ prior_k = prior_k[0]
+ prior_k[0] /= W
+ prior_k[1] /= H
+ traj_prior_Ks[i] = prior_k
+
+ options["num_frames"] = T
+ discretization = denoiser.discretization
+ torch.cuda.empty_cache()
+
+ seed_everything(seed)
+
+ # Get Data
+ input_indices = image_cond["input_indices"]
+ input_imgs = imgs[input_indices]
+ input_imgs_clip = imgs_clip[input_indices]
+ input_c2ws = camera_cond["c2w"][input_indices]
+ input_Ks = camera_cond["K"][input_indices]
+
+ test_indices = [i for i in range(len(imgs)) if i not in input_indices]
+ test_imgs = imgs[test_indices]
+ test_imgs_clip = imgs_clip[test_indices]
+ test_c2ws = camera_cond["c2w"][test_indices]
+ test_Ks = camera_cond["K"][test_indices]
+
+ if options.get("save_input", True):
+ save_output(
+ {"/image": input_imgs},
+ save_path=os.path.join(save_path, "input"),
+ video_save_fps=2,
+ )
+
+ if not use_traj_prior:
+ chunk_strategy = options.get("chunk_strategy", "gt")
+
+ (
+ _,
+ input_inds_per_chunk,
+ input_sels_per_chunk,
+ test_inds_per_chunk,
+ test_sels_per_chunk,
+ ) = chunk_input_and_test(
+ T,
+ input_c2ws,
+ test_c2ws,
+ input_indices,
+ test_indices,
+ options=options,
+ task=task,
+ chunk_strategy=chunk_strategy,
+ gt_input_inds=list(range(input_c2ws.shape[0])),
+ )
+ print(
+ f"One pass - chunking with `{chunk_strategy}` strategy: total "
+ f"{len(input_inds_per_chunk)} forward(s) ..."
+ )
+
+ all_samples = {}
+ all_test_inds = []
+ for i, (
+ chunk_input_inds,
+ chunk_input_sels,
+ chunk_test_inds,
+ chunk_test_sels,
+ ) in tqdm(
+ enumerate(
+ zip(
+ input_inds_per_chunk,
+ input_sels_per_chunk,
+ test_inds_per_chunk,
+ test_sels_per_chunk,
+ )
+ ),
+ total=len(input_inds_per_chunk),
+ leave=False,
+ ):
+ (
+ curr_input_sels,
+ curr_test_sels,
+ curr_input_maps,
+ curr_test_maps,
+ ) = pad_indices(
+ chunk_input_sels,
+ chunk_test_sels,
+ T=T,
+ padding_mode=options.get("t_padding_mode", "last"),
+ )
+ curr_imgs, curr_imgs_clip, curr_c2ws, curr_Ks = [
+ assemble(
+ input=x[chunk_input_inds],
+ test=y[chunk_test_inds],
+ input_maps=curr_input_maps,
+ test_maps=curr_test_maps,
+ )
+ for x, y in zip(
+ [
+ torch.cat(
+ [
+ input_imgs,
+ get_k_from_dict(all_samples, "samples-rgb").to(
+ input_imgs.device
+ ),
+ ],
+ dim=0,
+ ),
+ torch.cat(
+ [
+ input_imgs_clip,
+ get_k_from_dict(all_samples, "samples-rgb").to(
+ input_imgs.device
+ ),
+ ],
+ dim=0,
+ ),
+ torch.cat([input_c2ws, test_c2ws[all_test_inds]], dim=0),
+ torch.cat([input_Ks, test_Ks[all_test_inds]], dim=0),
+ ], # procedually append generated prior views to the input views
+ [test_imgs, test_imgs_clip, test_c2ws, test_Ks],
+ )
+ ]
+ value_dict = get_value_dict(
+ curr_imgs.to("cuda"),
+ curr_imgs_clip.to("cuda"),
+ curr_input_sels
+ + [
+ sel
+ for (ind, sel) in zip(
+ np.array(chunk_test_inds)[curr_test_maps[curr_test_maps != -1]],
+ curr_test_sels,
+ )
+ if test_indices[ind] in image_cond["input_indices"]
+ ],
+ curr_c2ws,
+ curr_Ks,
+ curr_input_sels
+ + [
+ sel
+ for (ind, sel) in zip(
+ np.array(chunk_test_inds)[curr_test_maps[curr_test_maps != -1]],
+ curr_test_sels,
+ )
+ if test_indices[ind] in camera_cond["input_indices"]
+ ],
+ all_c2ws=camera_cond["c2w"],
+ camera_scale=options.get("camera_scale", 2.0),
+ )
+ samplers = create_samplers(
+ options["guider_types"],
+ discretization,
+ [len(curr_imgs)],
+ options["num_steps"],
+ options["cfg_min"],
+ abort_event=abort_event,
+ )
+ assert len(samplers) == 1
+ samples = do_sample(
+ model,
+ ae,
+ conditioner,
+ denoiser,
+ samplers[0],
+ value_dict,
+ H,
+ W,
+ C,
+ F,
+ T=len(curr_imgs),
+ cfg=(
+ options["cfg"][0]
+ if isinstance(options["cfg"], (list, tuple))
+ else options["cfg"]
+ ),
+ **{k: options[k] for k in options if k not in ["cfg", "T"]},
+ )
+ samples = decode_output(
+ samples, len(curr_imgs), chunk_test_sels
+ ) # decode into dict
+ if options.get("save_first_pass", False):
+ save_output(
+ replace_or_include_input_for_dict(
+ samples,
+ chunk_test_sels,
+ curr_imgs,
+ curr_c2ws,
+ curr_Ks,
+ ),
+ save_path=os.path.join(save_path, "first-pass", f"forward_{i}"),
+ video_save_fps=2,
+ )
+ extend_dict(all_samples, samples)
+ all_test_inds.extend(chunk_test_inds)
+ else:
+ assert traj_prior_c2ws is not None, (
+ "`traj_prior_c2ws` should be set when using 2-pass sampling. One "
+ "potential reason is that the amount of input frames is larger than "
+ "T. Set `num_prior_frames` manually to overwrite the infered stats."
+ )
+ traj_prior_c2ws = torch.as_tensor(
+ traj_prior_c2ws,
+ device=input_c2ws.device,
+ dtype=input_c2ws.dtype,
+ )
+
+ if traj_prior_Ks is None:
+ traj_prior_Ks = test_Ks[:1].repeat_interleave(
+ traj_prior_c2ws.shape[0], dim=0
+ )
+
+ traj_prior_imgs = imgs.new_zeros(traj_prior_c2ws.shape[0], *imgs.shape[1:])
+ traj_prior_imgs_clip = imgs_clip.new_zeros(
+ traj_prior_c2ws.shape[0], *imgs_clip.shape[1:]
+ )
+
+ # ---------------------------------- first pass ----------------------------------
+ T_first_pass = T[0] if isinstance(T, (list, tuple)) else T
+ T_second_pass = T[1] if isinstance(T, (list, tuple)) else T
+ chunk_strategy_first_pass = options.get(
+ "chunk_strategy_first_pass", "gt-nearest"
+ )
+ (
+ _,
+ input_inds_per_chunk,
+ input_sels_per_chunk,
+ prior_inds_per_chunk,
+ prior_sels_per_chunk,
+ ) = chunk_input_and_test(
+ T_first_pass,
+ input_c2ws,
+ traj_prior_c2ws,
+ input_indices,
+ image_cond["prior_indices"],
+ options=options,
+ task=task,
+ chunk_strategy=chunk_strategy_first_pass,
+ gt_input_inds=list(range(input_c2ws.shape[0])),
+ )
+ print(
+ f"Two passes (first) - chunking with `{chunk_strategy_first_pass}` strategy: total "
+ f"{len(input_inds_per_chunk)} forward(s) ..."
+ )
+
+ all_samples = {}
+ all_prior_inds = []
+ for i, (
+ chunk_input_inds,
+ chunk_input_sels,
+ chunk_prior_inds,
+ chunk_prior_sels,
+ ) in tqdm(
+ enumerate(
+ zip(
+ input_inds_per_chunk,
+ input_sels_per_chunk,
+ prior_inds_per_chunk,
+ prior_sels_per_chunk,
+ )
+ ),
+ total=len(input_inds_per_chunk),
+ leave=False,
+ ):
+ (
+ curr_input_sels,
+ curr_prior_sels,
+ curr_input_maps,
+ curr_prior_maps,
+ ) = pad_indices(
+ chunk_input_sels,
+ chunk_prior_sels,
+ T=T_first_pass,
+ padding_mode=options.get("t_padding_mode", "last"),
+ )
+ curr_imgs, curr_imgs_clip, curr_c2ws, curr_Ks = [
+ assemble(
+ input=x[chunk_input_inds],
+ test=y[chunk_prior_inds],
+ input_maps=curr_input_maps,
+ test_maps=curr_prior_maps,
+ )
+ for x, y in zip(
+ [
+ torch.cat(
+ [
+ input_imgs,
+ get_k_from_dict(all_samples, "samples-rgb").to(
+ input_imgs.device
+ ),
+ ],
+ dim=0,
+ ),
+ torch.cat(
+ [
+ input_imgs_clip,
+ get_k_from_dict(all_samples, "samples-rgb").to(
+ input_imgs.device
+ ),
+ ],
+ dim=0,
+ ),
+ torch.cat([input_c2ws, traj_prior_c2ws[all_prior_inds]], dim=0),
+ torch.cat([input_Ks, traj_prior_Ks[all_prior_inds]], dim=0),
+ ], # procedually append generated prior views to the input views
+ [
+ traj_prior_imgs,
+ traj_prior_imgs_clip,
+ traj_prior_c2ws,
+ traj_prior_Ks,
+ ],
+ )
+ ]
+ value_dict = get_value_dict(
+ curr_imgs.to("cuda"),
+ curr_imgs_clip.to("cuda"),
+ curr_input_sels,
+ curr_c2ws,
+ curr_Ks,
+ list(range(T_first_pass)),
+ all_c2ws=camera_cond["c2w"],
+ camera_scale=options.get("camera_scale", 2.0),
+ )
+ samplers = create_samplers(
+ options["guider_types"],
+ discretization,
+ [T_first_pass, T_second_pass],
+ options["num_steps"],
+ options["cfg_min"],
+ abort_event=abort_event,
+ )
+ samples = do_sample(
+ model,
+ ae,
+ conditioner,
+ denoiser,
+ (
+ samplers[1]
+ if len(samplers) > 1
+ and options.get("ltr_first_pass", False)
+ and chunk_strategy_first_pass != "gt"
+ and i > 0
+ else samplers[0]
+ ),
+ value_dict,
+ H,
+ W,
+ C,
+ F,
+ cfg=(
+ options["cfg"][0]
+ if isinstance(options["cfg"], (list, tuple))
+ else options["cfg"]
+ ),
+ T=T_first_pass,
+ global_pbar=first_pass_pbar,
+ **{k: options[k] for k in options if k not in ["cfg", "T", "sampler"]},
+ )
+ if samples is None:
+ return
+ samples = decode_output(
+ samples, T_first_pass, chunk_prior_sels
+ ) # decode into dict
+ extend_dict(all_samples, samples)
+ all_prior_inds.extend(chunk_prior_inds)
+
+ if options.get("save_first_pass", True):
+ save_output(
+ all_samples,
+ save_path=os.path.join(save_path, "first-pass"),
+ video_save_fps=5,
+ )
+ video_path_0 = os.path.join(save_path, "first-pass", "samples-rgb.mp4")
+ yield video_path_0
+
+ # ---------------------------------- second pass ----------------------------------
+ prior_indices = image_cond["prior_indices"]
+ assert (
+ prior_indices is not None
+ ), "`prior_frame_indices` needs to be set if using 2-pass sampling."
+ prior_argsort = np.argsort(input_indices + prior_indices).tolist()
+ prior_indices = np.array(input_indices + prior_indices)[prior_argsort].tolist()
+ gt_input_inds = [prior_argsort.index(i) for i in range(input_c2ws.shape[0])]
+
+ traj_prior_imgs = torch.cat(
+ [input_imgs, get_k_from_dict(all_samples, "samples-rgb")], dim=0
+ )[prior_argsort]
+ traj_prior_imgs_clip = torch.cat(
+ [
+ input_imgs_clip,
+ get_k_from_dict(all_samples, "samples-rgb"),
+ ],
+ dim=0,
+ )[prior_argsort]
+ traj_prior_c2ws = torch.cat([input_c2ws, traj_prior_c2ws], dim=0)[prior_argsort]
+ traj_prior_Ks = torch.cat([input_Ks, traj_prior_Ks], dim=0)[prior_argsort]
+
+ update_kv_for_dict(all_samples, "samples-rgb", traj_prior_imgs)
+ update_kv_for_dict(all_samples, "samples-c2ws", traj_prior_c2ws)
+ update_kv_for_dict(all_samples, "samples-intrinsics", traj_prior_Ks)
+
+ chunk_strategy = options.get("chunk_strategy", "nearest")
+ (
+ _,
+ prior_inds_per_chunk,
+ prior_sels_per_chunk,
+ test_inds_per_chunk,
+ test_sels_per_chunk,
+ ) = chunk_input_and_test(
+ T_second_pass,
+ traj_prior_c2ws,
+ test_c2ws,
+ prior_indices,
+ test_indices,
+ options=options,
+ task=task,
+ chunk_strategy=chunk_strategy,
+ gt_input_inds=gt_input_inds,
+ )
+ print(
+ f"Two passes (second) - chunking with `{chunk_strategy}` strategy: total "
+ f"{len(prior_inds_per_chunk)} forward(s) ..."
+ )
+
+ all_samples = {}
+ all_test_inds = []
+ for i, (
+ chunk_prior_inds,
+ chunk_prior_sels,
+ chunk_test_inds,
+ chunk_test_sels,
+ ) in tqdm(
+ enumerate(
+ zip(
+ prior_inds_per_chunk,
+ prior_sels_per_chunk,
+ test_inds_per_chunk,
+ test_sels_per_chunk,
+ )
+ ),
+ total=len(prior_inds_per_chunk),
+ leave=False,
+ ):
+ (
+ curr_prior_sels,
+ curr_test_sels,
+ curr_prior_maps,
+ curr_test_maps,
+ ) = pad_indices(
+ chunk_prior_sels,
+ chunk_test_sels,
+ T=T_second_pass,
+ padding_mode="last",
+ )
+ curr_imgs, curr_imgs_clip, curr_c2ws, curr_Ks = [
+ assemble(
+ input=x[chunk_prior_inds],
+ test=y[chunk_test_inds],
+ input_maps=curr_prior_maps,
+ test_maps=curr_test_maps,
+ )
+ for x, y in zip(
+ [
+ traj_prior_imgs,
+ traj_prior_imgs_clip,
+ traj_prior_c2ws,
+ traj_prior_Ks,
+ ],
+ [test_imgs, test_imgs_clip, test_c2ws, test_Ks],
+ )
+ ]
+ value_dict = get_value_dict(
+ curr_imgs.to("cuda"),
+ curr_imgs_clip.to("cuda"),
+ curr_prior_sels,
+ curr_c2ws,
+ curr_Ks,
+ list(range(T_second_pass)),
+ all_c2ws=camera_cond["c2w"],
+ camera_scale=options.get("camera_scale", 2.0),
+ )
+ samples = do_sample(
+ model,
+ ae,
+ conditioner,
+ denoiser,
+ samplers[1] if len(samplers) > 1 else samplers[0],
+ value_dict,
+ H,
+ W,
+ C,
+ F,
+ T=T_second_pass,
+ cfg=(
+ options["cfg"][1]
+ if isinstance(options["cfg"], (list, tuple))
+ and len(options["cfg"]) > 1
+ else options["cfg"]
+ ),
+ global_pbar=second_pass_pbar,
+ **{k: options[k] for k in options if k not in ["cfg", "T", "sampler"]},
+ )
+ if samples is None:
+ return
+ samples = decode_output(
+ samples, T_second_pass, chunk_test_sels
+ ) # decode into dict
+ if options.get("save_second_pass", False):
+ save_output(
+ replace_or_include_input_for_dict(
+ samples,
+ chunk_test_sels,
+ curr_imgs,
+ curr_c2ws,
+ curr_Ks,
+ ),
+ save_path=os.path.join(save_path, "second-pass", f"forward_{i}"),
+ video_save_fps=2,
+ )
+ extend_dict(all_samples, samples)
+ all_test_inds.extend(chunk_test_inds)
+ all_samples = {
+ key: value[np.argsort(all_test_inds)] for key, value in all_samples.items()
+ }
+ save_output(
+ replace_or_include_input_for_dict(
+ all_samples,
+ test_indices,
+ imgs.clone(),
+ camera_cond["c2w"].clone(),
+ camera_cond["K"].clone(),
+ )
+ if options.get("replace_or_include_input", False)
+ else all_samples,
+ save_path=save_path,
+ video_save_fps=options.get("video_save_fps", 2),
+ )
+ video_path_1 = os.path.join(save_path, "samples-rgb.mp4")
+ yield video_path_1
diff --git a/seva/geometry.py b/seva/geometry.py
new file mode 100644
index 0000000000000000000000000000000000000000..0065447137cd1d64ee21f2235570df9b2c57ec78
--- /dev/null
+++ b/seva/geometry.py
@@ -0,0 +1,811 @@
+from typing import Literal
+
+import numpy as np
+import roma
+import scipy.interpolate
+import torch
+import torch.nn.functional as F
+
+DEFAULT_FOV_RAD = 0.9424777960769379 # 54 degrees by default
+
+
+def get_camera_dist(
+ source_c2ws: torch.Tensor, # N x 3 x 4
+ target_c2ws: torch.Tensor, # M x 3 x 4
+ mode: str = "translation",
+):
+ if mode == "rotation":
+ dists = torch.acos(
+ (
+ (
+ torch.matmul(
+ source_c2ws[:, None, :3, :3],
+ target_c2ws[None, :, :3, :3].transpose(-1, -2),
+ )
+ .diagonal(offset=0, dim1=-2, dim2=-1)
+ .sum(-1)
+ - 1
+ )
+ / 2
+ ).clamp(-1, 1)
+ ) * (180 / torch.pi)
+ elif mode == "translation":
+ dists = torch.norm(
+ source_c2ws[:, None, :3, 3] - target_c2ws[None, :, :3, 3], dim=-1
+ )
+ else:
+ raise NotImplementedError(
+ f"Mode {mode} is not implemented for finding nearest source indices."
+ )
+ return dists
+
+
+def to_hom(X):
+ # get homogeneous coordinates of the input
+ X_hom = torch.cat([X, torch.ones_like(X[..., :1])], dim=-1)
+ return X_hom
+
+
+def to_hom_pose(pose):
+ # get homogeneous coordinates of the input pose
+ if pose.shape[-2:] == (3, 4):
+ pose_hom = torch.eye(4, device=pose.device)[None].repeat(pose.shape[0], 1, 1)
+ pose_hom[:, :3, :] = pose
+ return pose_hom
+ return pose
+
+
+def get_default_intrinsics(
+ fov_rad=DEFAULT_FOV_RAD,
+ aspect_ratio=1.0,
+):
+ if not isinstance(fov_rad, torch.Tensor):
+ fov_rad = torch.tensor(
+ [fov_rad] if isinstance(fov_rad, (int, float)) else fov_rad
+ )
+ if aspect_ratio >= 1.0: # W >= H
+ focal_x = 0.5 / torch.tan(0.5 * fov_rad)
+ focal_y = focal_x * aspect_ratio
+ else: # W < H
+ focal_y = 0.5 / torch.tan(0.5 * fov_rad)
+ focal_x = focal_y / aspect_ratio
+ intrinsics = focal_x.new_zeros((focal_x.shape[0], 3, 3))
+ intrinsics[:, torch.eye(3, device=focal_x.device, dtype=bool)] = torch.stack(
+ [focal_x, focal_y, torch.ones_like(focal_x)], dim=-1
+ )
+ intrinsics[:, :, -1] = torch.tensor(
+ [0.5, 0.5, 1.0], device=focal_x.device, dtype=focal_x.dtype
+ )
+ return intrinsics
+
+
+def get_image_grid(img_h, img_w):
+ # add 0.5 is VERY important especially when your img_h and img_w
+ # is not very large (e.g., 72)!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+ y_range = torch.arange(img_h, dtype=torch.float32).add_(0.5)
+ x_range = torch.arange(img_w, dtype=torch.float32).add_(0.5)
+ Y, X = torch.meshgrid(y_range, x_range, indexing="ij") # [H,W]
+ xy_grid = torch.stack([X, Y], dim=-1).view(-1, 2) # [HW,2]
+ return to_hom(xy_grid) # [HW,3]
+
+
+def img2cam(X, cam_intr):
+ return X @ cam_intr.inverse().transpose(-1, -2)
+
+
+def cam2world(X, pose):
+ X_hom = to_hom(X)
+ pose_inv = torch.linalg.inv(to_hom_pose(pose))[..., :3, :4]
+ return X_hom @ pose_inv.transpose(-1, -2)
+
+
+def get_center_and_ray(
+ img_h, img_w, pose, intr, zero_center_for_debugging=False
+): # [HW,2]
+ # given the intrinsic/extrinsic matrices, get the camera center and ray directions]
+ # assert(opt.camera.model=="perspective")
+
+ # compute center and ray
+ grid_img = get_image_grid(img_h, img_w) # [HW,3]
+ grid_3D_cam = img2cam(grid_img.to(intr.device), intr.float()) # [B,HW,3]
+ center_3D_cam = torch.zeros_like(grid_3D_cam) # [B,HW,3]
+
+ # transform from camera to world coordinates
+ grid_3D = cam2world(grid_3D_cam, pose) # [B,HW,3]
+ center_3D = cam2world(center_3D_cam, pose) # [B,HW,3]
+ ray = grid_3D - center_3D # [B,HW,3]
+
+ return center_3D_cam if zero_center_for_debugging else center_3D, ray, grid_3D_cam
+
+
+def get_plucker_coordinates(
+ extrinsics_src,
+ extrinsics,
+ intrinsics=None,
+ fov_rad=DEFAULT_FOV_RAD,
+ mode="plucker",
+ rel_zero_translation=True,
+ zero_center_for_debugging=False,
+ target_size=[72, 72], # 576-size image
+ return_grid_cam=False, # save for later use if want restore
+):
+ if intrinsics is None:
+ intrinsics = get_default_intrinsics(fov_rad).to(extrinsics.device)
+ else:
+ # for some data preprocessed in the early stage (e.g., MVI and CO3D),
+ # intrinsics are expressed in raw pixel space (e.g., 576x576) instead
+ # of normalized image coordinates
+ if not (
+ torch.all(intrinsics[:, :2, -1] >= 0)
+ and torch.all(intrinsics[:, :2, -1] <= 1)
+ ):
+ intrinsics[:, :2] /= intrinsics.new_tensor(target_size).view(1, -1, 1) * 8
+ # you should ensure the intrisics are expressed in
+ # resolution-independent normalized image coordinates just performing a
+ # very simple verification here checking if principal points are
+ # between 0 and 1
+ assert (
+ torch.all(intrinsics[:, :2, -1] >= 0)
+ and torch.all(intrinsics[:, :2, -1] <= 1)
+ ), "Intrinsics should be expressed in resolution-independent normalized image coordinates."
+
+ c2w_src = torch.linalg.inv(extrinsics_src)
+ if not rel_zero_translation:
+ c2w_src[:3, 3] = c2w_src[3, :3] = 0.0
+ # transform coordinates from the source camera's coordinate system to the coordinate system of the respective camera
+ extrinsics_rel = torch.einsum(
+ "vnm,vmp->vnp", extrinsics, c2w_src[None].repeat(extrinsics.shape[0], 1, 1)
+ )
+
+ intrinsics[:, :2] *= extrinsics.new_tensor(
+ [
+ target_size[1], # w
+ target_size[0], # h
+ ]
+ ).view(1, -1, 1)
+ centers, rays, grid_cam = get_center_and_ray(
+ img_h=target_size[0],
+ img_w=target_size[1],
+ pose=extrinsics_rel[:, :3, :],
+ intr=intrinsics,
+ zero_center_for_debugging=zero_center_for_debugging,
+ )
+
+ if mode == "plucker" or "v1" in mode:
+ rays = torch.nn.functional.normalize(rays, dim=-1)
+ plucker = torch.cat((rays, torch.cross(centers, rays, dim=-1)), dim=-1)
+ else:
+ raise ValueError(f"Unknown Plucker coordinate mode: {mode}")
+
+ plucker = plucker.permute(0, 2, 1).reshape(plucker.shape[0], -1, *target_size)
+ if return_grid_cam:
+ return plucker, grid_cam.reshape(-1, *target_size, 3)
+ return plucker
+
+
+def rt_to_mat4(
+ R: torch.Tensor, t: torch.Tensor, s: torch.Tensor | None = None
+) -> torch.Tensor:
+ """
+ Args:
+ R (torch.Tensor): (..., 3, 3).
+ t (torch.Tensor): (..., 3).
+ s (torch.Tensor): (...,).
+
+ Returns:
+ torch.Tensor: (..., 4, 4)
+ """
+ mat34 = torch.cat([R, t[..., None]], dim=-1)
+ if s is None:
+ bottom = (
+ mat34.new_tensor([[0.0, 0.0, 0.0, 1.0]])
+ .reshape((1,) * (mat34.dim() - 2) + (1, 4))
+ .expand(mat34.shape[:-2] + (1, 4))
+ )
+ else:
+ bottom = F.pad(1.0 / s[..., None, None], (3, 0), value=0.0)
+ mat4 = torch.cat([mat34, bottom], dim=-2)
+ return mat4
+
+
+def get_preset_pose_fov(
+ option: Literal[
+ "orbit",
+ "spiral",
+ "lemniscate",
+ "zoom-in",
+ "zoom-out",
+ "dolly zoom-in",
+ "dolly zoom-out",
+ "move-forward",
+ "move-backward",
+ "move-up",
+ "move-down",
+ "move-left",
+ "move-right",
+ "roll",
+ ],
+ num_frames: int,
+ start_w2c: torch.Tensor,
+ look_at: torch.Tensor,
+ up_direction: torch.Tensor | None = None,
+ fov: float = DEFAULT_FOV_RAD,
+ spiral_radii: list[float] = [0.5, 0.5, 0.2],
+ zoom_factor: float | None = None,
+):
+ poses = fovs = None
+ if option == "orbit":
+ poses = torch.linalg.inv(
+ get_arc_horizontal_w2cs(
+ start_w2c,
+ look_at,
+ up_direction,
+ num_frames=num_frames,
+ endpoint=False,
+ )
+ ).numpy()
+ fovs = np.full((num_frames,), fov)
+ elif option == "spiral":
+ poses = generate_spiral_path(
+ torch.linalg.inv(start_w2c)[None].numpy() @ np.diagflat([1, -1, -1, 1]),
+ np.array([1, 5]),
+ n_frames=num_frames,
+ n_rots=2,
+ zrate=0.5,
+ radii=spiral_radii,
+ endpoint=False,
+ ) @ np.diagflat([1, -1, -1, 1])
+ poses = np.concatenate(
+ [
+ poses,
+ np.array([0.0, 0.0, 0.0, 1.0])[None, None].repeat(len(poses), 0),
+ ],
+ 1,
+ )
+ # We want the spiral trajectory to always start from start_w2c. Thus we
+ # apply the relative pose to get the final trajectory.
+ poses = (
+ np.linalg.inv(start_w2c.numpy())[None] @ np.linalg.inv(poses[:1]) @ poses
+ )
+ fovs = np.full((num_frames,), fov)
+ elif option == "lemniscate":
+ poses = torch.linalg.inv(
+ get_lemniscate_w2cs(
+ start_w2c,
+ look_at,
+ up_direction,
+ num_frames,
+ degree=60.0,
+ endpoint=False,
+ )
+ ).numpy()
+ fovs = np.full((num_frames,), fov)
+ elif option == "roll":
+ poses = torch.linalg.inv(
+ get_roll_w2cs(
+ start_w2c,
+ look_at,
+ None,
+ num_frames,
+ degree=360.0,
+ endpoint=False,
+ )
+ ).numpy()
+ fovs = np.full((num_frames,), fov)
+ elif option in [
+ "dolly zoom-in",
+ "dolly zoom-out",
+ "zoom-in",
+ "zoom-out",
+ ]:
+ if option.startswith("dolly"):
+ direction = "backward" if option == "dolly zoom-in" else "forward"
+ poses = torch.linalg.inv(
+ get_moving_w2cs(
+ start_w2c,
+ look_at,
+ up_direction,
+ num_frames,
+ endpoint=True,
+ direction=direction,
+ )
+ ).numpy()
+ else:
+ poses = torch.linalg.inv(start_w2c)[None].repeat(num_frames, 1, 1).numpy()
+ fov_rad_start = fov
+ if zoom_factor is None:
+ zoom_factor = 0.28 if option.endswith("zoom-in") else 1.5
+ fov_rad_end = zoom_factor * fov
+ fovs = (
+ np.linspace(0, 1, num_frames) * (fov_rad_end - fov_rad_start)
+ + fov_rad_start
+ )
+ elif option in [
+ "move-forward",
+ "move-backward",
+ "move-up",
+ "move-down",
+ "move-left",
+ "move-right",
+ ]:
+ poses = torch.linalg.inv(
+ get_moving_w2cs(
+ start_w2c,
+ look_at,
+ up_direction,
+ num_frames,
+ endpoint=True,
+ direction=option.removeprefix("move-"),
+ )
+ ).numpy()
+ fovs = np.full((num_frames,), fov)
+ else:
+ raise ValueError(f"Unknown preset option {option}.")
+
+ return poses, fovs
+
+
+def get_lookat(origins: torch.Tensor, viewdirs: torch.Tensor) -> torch.Tensor:
+ """Triangulate a set of rays to find a single lookat point.
+
+ Args:
+ origins (torch.Tensor): A (N, 3) array of ray origins.
+ viewdirs (torch.Tensor): A (N, 3) array of ray view directions.
+
+ Returns:
+ torch.Tensor: A (3,) lookat point.
+ """
+
+ viewdirs = torch.nn.functional.normalize(viewdirs, dim=-1)
+ eye = torch.eye(3, device=origins.device, dtype=origins.dtype)[None]
+ # Calculate projection matrix I - rr^T
+ I_min_cov = eye - (viewdirs[..., None] * viewdirs[..., None, :])
+ # Compute sum of projections
+ sum_proj = I_min_cov.matmul(origins[..., None]).sum(dim=-3)
+ # Solve for the intersection point using least squares
+ lookat = torch.linalg.lstsq(I_min_cov.sum(dim=-3), sum_proj).solution[..., 0]
+ # Check NaNs.
+ assert not torch.any(torch.isnan(lookat))
+ return lookat
+
+
+def get_lookat_w2cs(
+ positions: torch.Tensor,
+ lookat: torch.Tensor,
+ up: torch.Tensor,
+ face_off: bool = False,
+):
+ """
+ Args:
+ positions: (N, 3) tensor of camera positions
+ lookat: (3,) tensor of lookat point
+ up: (3,) or (N, 3) tensor of up vector
+
+ Returns:
+ w2cs: (N, 3, 3) tensor of world to camera rotation matrices
+ """
+ forward_vectors = F.normalize(lookat - positions, dim=-1)
+ if face_off:
+ forward_vectors = -forward_vectors
+ if up.dim() == 1:
+ up = up[None]
+ right_vectors = F.normalize(torch.cross(forward_vectors, up, dim=-1), dim=-1)
+ down_vectors = F.normalize(
+ torch.cross(forward_vectors, right_vectors, dim=-1), dim=-1
+ )
+ Rs = torch.stack([right_vectors, down_vectors, forward_vectors], dim=-1)
+ w2cs = torch.linalg.inv(rt_to_mat4(Rs, positions))
+ return w2cs
+
+
+def get_arc_horizontal_w2cs(
+ ref_w2c: torch.Tensor,
+ lookat: torch.Tensor,
+ up: torch.Tensor | None,
+ num_frames: int,
+ clockwise: bool = True,
+ face_off: bool = False,
+ endpoint: bool = False,
+ degree: float = 360.0,
+ ref_up_shift: float = 0.0,
+ ref_radius_scale: float = 1.0,
+ **_,
+) -> torch.Tensor:
+ ref_c2w = torch.linalg.inv(ref_w2c)
+ ref_position = ref_c2w[:3, 3]
+ if up is None:
+ up = -ref_c2w[:3, 1]
+ assert up is not None
+ ref_position += up * ref_up_shift
+ ref_position *= ref_radius_scale
+ thetas = (
+ torch.linspace(0.0, torch.pi * degree / 180, num_frames, device=ref_w2c.device)
+ if endpoint
+ else torch.linspace(
+ 0.0, torch.pi * degree / 180, num_frames + 1, device=ref_w2c.device
+ )[:-1]
+ )
+ if not clockwise:
+ thetas = -thetas
+ positions = (
+ torch.einsum(
+ "nij,j->ni",
+ roma.rotvec_to_rotmat(thetas[:, None] * up[None]),
+ ref_position - lookat,
+ )
+ + lookat
+ )
+ return get_lookat_w2cs(positions, lookat, up, face_off=face_off)
+
+
+def get_lemniscate_w2cs(
+ ref_w2c: torch.Tensor,
+ lookat: torch.Tensor,
+ up: torch.Tensor | None,
+ num_frames: int,
+ degree: float,
+ endpoint: bool = False,
+ **_,
+) -> torch.Tensor:
+ ref_c2w = torch.linalg.inv(ref_w2c)
+ a = torch.linalg.norm(ref_c2w[:3, 3] - lookat) * np.tan(degree / 360 * np.pi)
+ # Lemniscate curve in camera space. Starting at the origin.
+ thetas = (
+ torch.linspace(0, 2 * torch.pi, num_frames, device=ref_w2c.device)
+ if endpoint
+ else torch.linspace(0, 2 * torch.pi, num_frames + 1, device=ref_w2c.device)[:-1]
+ ) + torch.pi / 2
+ positions = torch.stack(
+ [
+ a * torch.cos(thetas) / (1 + torch.sin(thetas) ** 2),
+ a * torch.cos(thetas) * torch.sin(thetas) / (1 + torch.sin(thetas) ** 2),
+ torch.zeros(num_frames, device=ref_w2c.device),
+ ],
+ dim=-1,
+ )
+ # Transform to world space.
+ positions = torch.einsum(
+ "ij,nj->ni", ref_c2w[:3], F.pad(positions, (0, 1), value=1.0)
+ )
+ if up is None:
+ up = -ref_c2w[:3, 1]
+ assert up is not None
+ return get_lookat_w2cs(positions, lookat, up)
+
+
+def get_moving_w2cs(
+ ref_w2c: torch.Tensor,
+ lookat: torch.Tensor,
+ up: torch.Tensor | None,
+ num_frames: int,
+ endpoint: bool = False,
+ direction: str = "forward",
+ tilt_xy: torch.Tensor = None,
+):
+ """
+ Args:
+ ref_w2c: (4, 4) tensor of the reference wolrd-to-camera matrix
+ lookat: (3,) tensor of lookat point
+ up: (3,) tensor of up vector
+
+ Returns:
+ w2cs: (N, 3, 3) tensor of world to camera rotation matrices
+ """
+ ref_c2w = torch.linalg.inv(ref_w2c)
+ ref_position = ref_c2w[:3, -1]
+ if up is None:
+ up = -ref_c2w[:3, 1]
+
+ direction_vectors = {
+ "forward": (lookat - ref_position).clone(),
+ "backward": -(lookat - ref_position).clone(),
+ "up": up.clone(),
+ "down": -up.clone(),
+ "right": torch.cross((lookat - ref_position), up, dim=0),
+ "left": -torch.cross((lookat - ref_position), up, dim=0),
+ }
+ if direction not in direction_vectors:
+ raise ValueError(
+ f"Invalid direction: {direction}. Must be one of {list(direction_vectors.keys())}"
+ )
+
+ positions = ref_position + (
+ F.normalize(direction_vectors[direction], dim=0)
+ * (
+ torch.linspace(0, 0.99, num_frames, device=ref_w2c.device)
+ if endpoint
+ else torch.linspace(0, 1, num_frames + 1, device=ref_w2c.device)[:-1]
+ )[:, None]
+ )
+
+ if tilt_xy is not None:
+ positions[:, :2] += tilt_xy
+
+ return get_lookat_w2cs(positions, lookat, up)
+
+
+def get_roll_w2cs(
+ ref_w2c: torch.Tensor,
+ lookat: torch.Tensor,
+ up: torch.Tensor | None,
+ num_frames: int,
+ endpoint: bool = False,
+ degree: float = 360.0,
+ **_,
+) -> torch.Tensor:
+ ref_c2w = torch.linalg.inv(ref_w2c)
+ ref_position = ref_c2w[:3, 3]
+ if up is None:
+ up = -ref_c2w[:3, 1] # Infer the up vector from the reference.
+
+ # Create vertical angles
+ thetas = (
+ torch.linspace(0.0, torch.pi * degree / 180, num_frames, device=ref_w2c.device)
+ if endpoint
+ else torch.linspace(
+ 0.0, torch.pi * degree / 180, num_frames + 1, device=ref_w2c.device
+ )[:-1]
+ )[:, None]
+
+ lookat_vector = F.normalize(lookat[None].float(), dim=-1)
+ up = up[None]
+ up = (
+ up * torch.cos(thetas)
+ + torch.cross(lookat_vector, up) * torch.sin(thetas)
+ + lookat_vector
+ * torch.einsum("ij,ij->i", lookat_vector, up)[:, None]
+ * (1 - torch.cos(thetas))
+ )
+
+ # Normalize the camera orientation
+ return get_lookat_w2cs(ref_position[None].repeat(num_frames, 1), lookat, up)
+
+
+def normalize(x):
+ """Normalization helper function."""
+ return x / np.linalg.norm(x)
+
+
+def viewmatrix(lookdir, up, position, subtract_position=False):
+ """Construct lookat view matrix."""
+ vec2 = normalize((lookdir - position) if subtract_position else lookdir)
+ vec0 = normalize(np.cross(up, vec2))
+ vec1 = normalize(np.cross(vec2, vec0))
+ m = np.stack([vec0, vec1, vec2, position], axis=1)
+ return m
+
+
+def poses_avg(poses):
+ """New pose using average position, z-axis, and up vector of input poses."""
+ position = poses[:, :3, 3].mean(0)
+ z_axis = poses[:, :3, 2].mean(0)
+ up = poses[:, :3, 1].mean(0)
+ cam2world = viewmatrix(z_axis, up, position)
+ return cam2world
+
+
+def generate_spiral_path(
+ poses, bounds, n_frames=120, n_rots=2, zrate=0.5, endpoint=False, radii=None
+):
+ """Calculates a forward facing spiral path for rendering."""
+ # Find a reasonable 'focus depth' for this dataset as a weighted average
+ # of near and far bounds in disparity space.
+ close_depth, inf_depth = bounds.min() * 0.9, bounds.max() * 5.0
+ dt = 0.75
+ focal = 1 / ((1 - dt) / close_depth + dt / inf_depth)
+
+ # Get radii for spiral path using 90th percentile of camera positions.
+ positions = poses[:, :3, 3]
+ if radii is None:
+ radii = np.percentile(np.abs(positions), 90, 0)
+ radii = np.concatenate([radii, [1.0]])
+
+ # Generate poses for spiral path.
+ render_poses = []
+ cam2world = poses_avg(poses)
+ up = poses[:, :3, 1].mean(0)
+ for theta in np.linspace(0.0, 2.0 * np.pi * n_rots, n_frames, endpoint=endpoint):
+ t = radii * [np.cos(theta), -np.sin(theta), -np.sin(theta * zrate), 1.0]
+ position = cam2world @ t
+ lookat = cam2world @ [0, 0, -focal, 1.0]
+ z_axis = position - lookat
+ render_poses.append(viewmatrix(z_axis, up, position))
+ render_poses = np.stack(render_poses, axis=0)
+ return render_poses
+
+
+def generate_interpolated_path(
+ poses: np.ndarray,
+ n_interp: int,
+ spline_degree: int = 5,
+ smoothness: float = 0.03,
+ rot_weight: float = 0.1,
+ endpoint: bool = False,
+):
+ """Creates a smooth spline path between input keyframe camera poses.
+
+ Spline is calculated with poses in format (position, lookat-point, up-point).
+
+ Args:
+ poses: (n, 3, 4) array of input pose keyframes.
+ n_interp: returned path will have n_interp * (n - 1) total poses.
+ spline_degree: polynomial degree of B-spline.
+ smoothness: parameter for spline smoothing, 0 forces exact interpolation.
+ rot_weight: relative weighting of rotation/translation in spline solve.
+
+ Returns:
+ Array of new camera poses with shape (n_interp * (n - 1), 3, 4).
+ """
+
+ def poses_to_points(poses, dist):
+ """Converts from pose matrices to (position, lookat, up) format."""
+ pos = poses[:, :3, -1]
+ lookat = poses[:, :3, -1] - dist * poses[:, :3, 2]
+ up = poses[:, :3, -1] + dist * poses[:, :3, 1]
+ return np.stack([pos, lookat, up], 1)
+
+ def points_to_poses(points):
+ """Converts from (position, lookat, up) format to pose matrices."""
+ return np.array([viewmatrix(p - l, u - p, p) for p, l, u in points])
+
+ def interp(points, n, k, s):
+ """Runs multidimensional B-spline interpolation on the input points."""
+ sh = points.shape
+ pts = np.reshape(points, (sh[0], -1))
+ k = min(k, sh[0] - 1)
+ tck, _ = scipy.interpolate.splprep(pts.T, k=k, s=s)
+ u = np.linspace(0, 1, n, endpoint=endpoint)
+ new_points = np.array(scipy.interpolate.splev(u, tck))
+ new_points = np.reshape(new_points.T, (n, sh[1], sh[2]))
+ return new_points
+
+ points = poses_to_points(poses, dist=rot_weight)
+ new_points = interp(
+ points, n_interp * (points.shape[0] - 1), k=spline_degree, s=smoothness
+ )
+ return points_to_poses(new_points)
+
+
+def similarity_from_cameras(c2w, strict_scaling=False, center_method="focus"):
+ """
+ reference: nerf-factory
+ Get a similarity transform to normalize dataset
+ from c2w (OpenCV convention) cameras
+ :param c2w: (N, 4)
+ :return T (4,4) , scale (float)
+ """
+ t = c2w[:, :3, 3]
+ R = c2w[:, :3, :3]
+
+ # (1) Rotate the world so that z+ is the up axis
+ # we estimate the up axis by averaging the camera up axes
+ ups = np.sum(R * np.array([0, -1.0, 0]), axis=-1)
+ world_up = np.mean(ups, axis=0)
+ world_up /= np.linalg.norm(world_up)
+
+ up_camspace = np.array([0.0, -1.0, 0.0])
+ c = (up_camspace * world_up).sum()
+ cross = np.cross(world_up, up_camspace)
+ skew = np.array(
+ [
+ [0.0, -cross[2], cross[1]],
+ [cross[2], 0.0, -cross[0]],
+ [-cross[1], cross[0], 0.0],
+ ]
+ )
+ if c > -1:
+ R_align = np.eye(3) + skew + (skew @ skew) * 1 / (1 + c)
+ else:
+ # In the unlikely case the original data has y+ up axis,
+ # rotate 180-deg about x axis
+ R_align = np.array([[-1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]])
+
+ # R_align = np.eye(3) # DEBUG
+ R = R_align @ R
+ fwds = np.sum(R * np.array([0, 0.0, 1.0]), axis=-1)
+ t = (R_align @ t[..., None])[..., 0]
+
+ # (2) Recenter the scene.
+ if center_method == "focus":
+ # find the closest point to the origin for each camera's center ray
+ nearest = t + (fwds * -t).sum(-1)[:, None] * fwds
+ translate = -np.median(nearest, axis=0)
+ elif center_method == "poses":
+ # use center of the camera positions
+ translate = -np.median(t, axis=0)
+ else:
+ raise ValueError(f"Unknown center_method {center_method}")
+
+ transform = np.eye(4)
+ transform[:3, 3] = translate
+ transform[:3, :3] = R_align
+
+ # (3) Rescale the scene using camera distances
+ scale_fn = np.max if strict_scaling else np.median
+ inv_scale = scale_fn(np.linalg.norm(t + translate, axis=-1))
+ if inv_scale == 0:
+ inv_scale = 1.0
+ scale = 1.0 / inv_scale
+ transform[:3, :] *= scale
+
+ return transform
+
+
+def align_principle_axes(point_cloud):
+ # Compute centroid
+ centroid = np.median(point_cloud, axis=0)
+
+ # Translate point cloud to centroid
+ translated_point_cloud = point_cloud - centroid
+
+ # Compute covariance matrix
+ covariance_matrix = np.cov(translated_point_cloud, rowvar=False)
+
+ # Compute eigenvectors and eigenvalues
+ eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)
+
+ # Sort eigenvectors by eigenvalues (descending order) so that the z-axis
+ # is the principal axis with the smallest eigenvalue.
+ sort_indices = eigenvalues.argsort()[::-1]
+ eigenvectors = eigenvectors[:, sort_indices]
+
+ # Check orientation of eigenvectors. If the determinant of the eigenvectors is
+ # negative, then we need to flip the sign of one of the eigenvectors.
+ if np.linalg.det(eigenvectors) < 0:
+ eigenvectors[:, 0] *= -1
+
+ # Create rotation matrix
+ rotation_matrix = eigenvectors.T
+
+ # Create SE(3) matrix (4x4 transformation matrix)
+ transform = np.eye(4)
+ transform[:3, :3] = rotation_matrix
+ transform[:3, 3] = -rotation_matrix @ centroid
+
+ return transform
+
+
+def transform_points(matrix, points):
+ """Transform points using a SE(4) matrix.
+
+ Args:
+ matrix: 4x4 SE(4) matrix
+ points: Nx3 array of points
+
+ Returns:
+ Nx3 array of transformed points
+ """
+ assert matrix.shape == (4, 4)
+ assert len(points.shape) == 2 and points.shape[1] == 3
+ return points @ matrix[:3, :3].T + matrix[:3, 3]
+
+
+def transform_cameras(matrix, camtoworlds):
+ """Transform cameras using a SE(4) matrix.
+
+ Args:
+ matrix: 4x4 SE(4) matrix
+ camtoworlds: Nx4x4 array of camera-to-world matrices
+
+ Returns:
+ Nx4x4 array of transformed camera-to-world matrices
+ """
+ assert matrix.shape == (4, 4)
+ assert len(camtoworlds.shape) == 3 and camtoworlds.shape[1:] == (4, 4)
+ camtoworlds = np.einsum("nij, ki -> nkj", camtoworlds, matrix)
+ scaling = np.linalg.norm(camtoworlds[:, 0, :3], axis=1)
+ camtoworlds[:, :3, :3] = camtoworlds[:, :3, :3] / scaling[:, None, None]
+ return camtoworlds
+
+
+def normalize_scene(camtoworlds, points=None, camera_center_method="focus"):
+ T1 = similarity_from_cameras(camtoworlds, center_method=camera_center_method)
+ camtoworlds = transform_cameras(T1, camtoworlds)
+ if points is not None:
+ points = transform_points(T1, points)
+ T2 = align_principle_axes(points)
+ camtoworlds = transform_cameras(T2, camtoworlds)
+ points = transform_points(T2, points)
+ return camtoworlds, points, T2 @ T1
+ else:
+ return camtoworlds, T1
diff --git a/seva/gui.py b/seva/gui.py
new file mode 100644
index 0000000000000000000000000000000000000000..90be6d16e9acdfc2d61be39d6c354a0d3688addf
--- /dev/null
+++ b/seva/gui.py
@@ -0,0 +1,975 @@
+import colorsys
+import dataclasses
+import threading
+import time
+from pathlib import Path
+
+import numpy as np
+import scipy
+import splines
+import splines.quaternion
+import torch
+import viser
+import viser.transforms as vt
+
+from seva.geometry import get_preset_pose_fov
+
+
+@dataclasses.dataclass
+class Keyframe(object):
+ position: np.ndarray
+ wxyz: np.ndarray
+ override_fov_enabled: bool
+ override_fov_rad: float
+ aspect: float
+ override_transition_enabled: bool
+ override_transition_sec: float | None
+
+ @staticmethod
+ def from_camera(camera: viser.CameraHandle, aspect: float) -> "Keyframe":
+ return Keyframe(
+ camera.position,
+ camera.wxyz,
+ override_fov_enabled=False,
+ override_fov_rad=camera.fov,
+ aspect=aspect,
+ override_transition_enabled=False,
+ override_transition_sec=None,
+ )
+
+ @staticmethod
+ def from_se3(se3: vt.SE3, fov: float, aspect: float) -> "Keyframe":
+ return Keyframe(
+ se3.translation(),
+ se3.rotation().wxyz,
+ override_fov_enabled=False,
+ override_fov_rad=fov,
+ aspect=aspect,
+ override_transition_enabled=False,
+ override_transition_sec=None,
+ )
+
+
+class CameraTrajectory(object):
+ def __init__(
+ self,
+ server: viser.ViserServer,
+ duration_element: viser.GuiInputHandle[float],
+ scene_scale: float,
+ scene_node_prefix: str = "/",
+ ):
+ self._server = server
+ self._keyframes: dict[int, tuple[Keyframe, viser.CameraFrustumHandle]] = {}
+ self._keyframe_counter: int = 0
+ self._spline_nodes: list[viser.SceneNodeHandle] = []
+ self._camera_edit_panel: viser.Gui3dContainerHandle | None = None
+
+ self._orientation_spline: splines.quaternion.KochanekBartels | None = None
+ self._position_spline: splines.KochanekBartels | None = None
+ self._fov_spline: splines.KochanekBartels | None = None
+
+ self._keyframes_visible: bool = True
+
+ self._duration_element = duration_element
+ self._scene_node_prefix = scene_node_prefix
+
+ self.scene_scale = scene_scale
+ # These parameters should be overridden externally.
+ self.loop: bool = False
+ self.framerate: float = 30.0
+ self.tension: float = 0.0 # Tension / alpha term.
+ self.default_fov: float = 0.0
+ self.default_transition_sec: float = 0.0
+ self.show_spline: bool = True
+
+ def set_keyframes_visible(self, visible: bool) -> None:
+ self._keyframes_visible = visible
+ for keyframe in self._keyframes.values():
+ keyframe[1].visible = visible
+
+ def add_camera(self, keyframe: Keyframe, keyframe_index: int | None = None) -> None:
+ """Add a new camera, or replace an old one if `keyframe_index` is passed in."""
+ server = self._server
+
+ # Add a keyframe if we aren't replacing an existing one.
+ if keyframe_index is None:
+ keyframe_index = self._keyframe_counter
+ self._keyframe_counter += 1
+
+ print(
+ f"{keyframe.wxyz=} {keyframe.position=} {keyframe_index=} {keyframe.aspect=}"
+ )
+ frustum_handle = server.scene.add_camera_frustum(
+ str(Path(self._scene_node_prefix) / f"cameras/{keyframe_index}"),
+ fov=(
+ keyframe.override_fov_rad
+ if keyframe.override_fov_enabled
+ else self.default_fov
+ ),
+ aspect=keyframe.aspect,
+ scale=0.1 * self.scene_scale,
+ color=(200, 10, 30),
+ wxyz=keyframe.wxyz,
+ position=keyframe.position,
+ visible=self._keyframes_visible,
+ )
+ self._server.scene.add_icosphere(
+ str(Path(self._scene_node_prefix) / f"cameras/{keyframe_index}/sphere"),
+ radius=0.03,
+ color=(200, 10, 30),
+ )
+
+ @frustum_handle.on_click
+ def _(_) -> None:
+ if self._camera_edit_panel is not None:
+ self._camera_edit_panel.remove()
+ self._camera_edit_panel = None
+
+ with server.scene.add_3d_gui_container(
+ "/camera_edit_panel",
+ position=keyframe.position,
+ ) as camera_edit_panel:
+ self._camera_edit_panel = camera_edit_panel
+ override_fov = server.gui.add_checkbox(
+ "Override FOV", initial_value=keyframe.override_fov_enabled
+ )
+ override_fov_degrees = server.gui.add_slider(
+ "Override FOV (degrees)",
+ 5.0,
+ 175.0,
+ step=0.1,
+ initial_value=keyframe.override_fov_rad * 180.0 / np.pi,
+ disabled=not keyframe.override_fov_enabled,
+ )
+ delete_button = server.gui.add_button(
+ "Delete", color="red", icon=viser.Icon.TRASH
+ )
+ go_to_button = server.gui.add_button("Go to")
+ close_button = server.gui.add_button("Close")
+
+ @override_fov.on_update
+ def _(_) -> None:
+ keyframe.override_fov_enabled = override_fov.value
+ override_fov_degrees.disabled = not override_fov.value
+ self.add_camera(keyframe, keyframe_index)
+
+ @override_fov_degrees.on_update
+ def _(_) -> None:
+ keyframe.override_fov_rad = override_fov_degrees.value / 180.0 * np.pi
+ self.add_camera(keyframe, keyframe_index)
+
+ @delete_button.on_click
+ def _(event: viser.GuiEvent) -> None:
+ assert event.client is not None
+ with event.client.gui.add_modal("Confirm") as modal:
+ event.client.gui.add_markdown("Delete keyframe?")
+ confirm_button = event.client.gui.add_button(
+ "Yes", color="red", icon=viser.Icon.TRASH
+ )
+ exit_button = event.client.gui.add_button("Cancel")
+
+ @confirm_button.on_click
+ def _(_) -> None:
+ assert camera_edit_panel is not None
+
+ keyframe_id = None
+ for i, keyframe_tuple in self._keyframes.items():
+ if keyframe_tuple[1] is frustum_handle:
+ keyframe_id = i
+ break
+ assert keyframe_id is not None
+
+ self._keyframes.pop(keyframe_id)
+ frustum_handle.remove()
+ camera_edit_panel.remove()
+ self._camera_edit_panel = None
+ modal.close()
+ self.update_spline()
+
+ @exit_button.on_click
+ def _(_) -> None:
+ modal.close()
+
+ @go_to_button.on_click
+ def _(event: viser.GuiEvent) -> None:
+ assert event.client is not None
+ client = event.client
+ T_world_current = vt.SE3.from_rotation_and_translation(
+ vt.SO3(client.camera.wxyz), client.camera.position
+ )
+ T_world_target = vt.SE3.from_rotation_and_translation(
+ vt.SO3(keyframe.wxyz), keyframe.position
+ ) @ vt.SE3.from_translation(np.array([0.0, 0.0, -0.5]))
+
+ T_current_target = T_world_current.inverse() @ T_world_target
+
+ for j in range(10):
+ T_world_set = T_world_current @ vt.SE3.exp(
+ T_current_target.log() * j / 9.0
+ )
+
+ # Important bit: we atomically set both the orientation and
+ # the position of the camera.
+ with client.atomic():
+ client.camera.wxyz = T_world_set.rotation().wxyz
+ client.camera.position = T_world_set.translation()
+ time.sleep(1.0 / 30.0)
+
+ @close_button.on_click
+ def _(_) -> None:
+ assert camera_edit_panel is not None
+ camera_edit_panel.remove()
+ self._camera_edit_panel = None
+
+ self._keyframes[keyframe_index] = (keyframe, frustum_handle)
+
+ def update_aspect(self, aspect: float) -> None:
+ for keyframe_index, frame in self._keyframes.items():
+ frame = dataclasses.replace(frame[0], aspect=aspect)
+ self.add_camera(frame, keyframe_index=keyframe_index)
+
+ def get_aspect(self) -> float:
+ """Get W/H aspect ratio, which is shared across all keyframes."""
+ assert len(self._keyframes) > 0
+ return next(iter(self._keyframes.values()))[0].aspect
+
+ def reset(self) -> None:
+ for frame in self._keyframes.values():
+ print(f"removing {frame[1]}")
+ frame[1].remove()
+ self._keyframes.clear()
+ self.update_spline()
+ print("camera traj reset")
+
+ def spline_t_from_t_sec(self, time: np.ndarray) -> np.ndarray:
+ """From a time value in seconds, compute a t value for our geometric
+ spline interpolation. An increment of 1 for the latter will move the
+ camera forward by one keyframe.
+
+ We use a PCHIP spline here to guarantee monotonicity.
+ """
+ transition_times_cumsum = self.compute_transition_times_cumsum()
+ spline_indices = np.arange(transition_times_cumsum.shape[0])
+
+ if self.loop:
+ # In the case of a loop, we pad the spline to match the start/end
+ # slopes.
+ interpolator = scipy.interpolate.PchipInterpolator(
+ x=np.concatenate(
+ [
+ [-(transition_times_cumsum[-1] - transition_times_cumsum[-2])],
+ transition_times_cumsum,
+ transition_times_cumsum[-1:] + transition_times_cumsum[1:2],
+ ],
+ axis=0,
+ ),
+ y=np.concatenate(
+ [[-1], spline_indices, [spline_indices[-1] + 1]], # type: ignore
+ axis=0,
+ ),
+ )
+ else:
+ interpolator = scipy.interpolate.PchipInterpolator(
+ x=transition_times_cumsum, y=spline_indices
+ )
+
+ # Clip to account for floating point error.
+ return np.clip(interpolator(time), 0, spline_indices[-1])
+
+ def interpolate_pose_and_fov_rad(
+ self, normalized_t: float
+ ) -> tuple[vt.SE3, float] | None:
+ if len(self._keyframes) < 2:
+ return None
+
+ self._fov_spline = splines.KochanekBartels(
+ [
+ (
+ keyframe[0].override_fov_rad
+ if keyframe[0].override_fov_enabled
+ else self.default_fov
+ )
+ for keyframe in self._keyframes.values()
+ ],
+ tcb=(self.tension, 0.0, 0.0),
+ endconditions="closed" if self.loop else "natural",
+ )
+
+ assert self._orientation_spline is not None
+ assert self._position_spline is not None
+ assert self._fov_spline is not None
+
+ max_t = self.compute_duration()
+ t = max_t * normalized_t
+ spline_t = float(self.spline_t_from_t_sec(np.array(t)))
+
+ quat = self._orientation_spline.evaluate(spline_t)
+ assert isinstance(quat, splines.quaternion.UnitQuaternion)
+ return (
+ vt.SE3.from_rotation_and_translation(
+ vt.SO3(np.array([quat.scalar, *quat.vector])),
+ self._position_spline.evaluate(spline_t),
+ ),
+ float(self._fov_spline.evaluate(spline_t)),
+ )
+
+ def update_spline(self) -> None:
+ num_frames = int(self.compute_duration() * self.framerate)
+ keyframes = list(self._keyframes.values())
+
+ if num_frames <= 0 or not self.show_spline or len(keyframes) < 2:
+ for node in self._spline_nodes:
+ node.remove()
+ self._spline_nodes.clear()
+ return
+
+ transition_times_cumsum = self.compute_transition_times_cumsum()
+
+ self._orientation_spline = splines.quaternion.KochanekBartels(
+ [
+ splines.quaternion.UnitQuaternion.from_unit_xyzw(
+ np.roll(keyframe[0].wxyz, shift=-1)
+ )
+ for keyframe in keyframes
+ ],
+ tcb=(self.tension, 0.0, 0.0),
+ endconditions="closed" if self.loop else "natural",
+ )
+ self._position_spline = splines.KochanekBartels(
+ [keyframe[0].position for keyframe in keyframes],
+ tcb=(self.tension, 0.0, 0.0),
+ endconditions="closed" if self.loop else "natural",
+ )
+
+ # Update visualized spline.
+ points_array = self._position_spline.evaluate(
+ self.spline_t_from_t_sec(
+ np.linspace(0, transition_times_cumsum[-1], num_frames)
+ )
+ )
+ colors_array = np.array(
+ [
+ colorsys.hls_to_rgb(h, 0.5, 1.0)
+ for h in np.linspace(0.0, 1.0, len(points_array))
+ ]
+ )
+
+ # Clear prior spline nodes.
+ for node in self._spline_nodes:
+ node.remove()
+ self._spline_nodes.clear()
+
+ self._spline_nodes.append(
+ self._server.scene.add_spline_catmull_rom(
+ str(Path(self._scene_node_prefix) / "camera_spline"),
+ positions=points_array,
+ color=(220, 220, 220),
+ closed=self.loop,
+ line_width=1.0,
+ segments=points_array.shape[0] + 1,
+ )
+ )
+ self._spline_nodes.append(
+ self._server.scene.add_point_cloud(
+ str(Path(self._scene_node_prefix) / "camera_spline/points"),
+ points=points_array,
+ colors=colors_array,
+ point_size=0.04,
+ )
+ )
+
+ def make_transition_handle(i: int) -> None:
+ assert self._position_spline is not None
+ transition_pos = self._position_spline.evaluate(
+ float(
+ self.spline_t_from_t_sec(
+ (transition_times_cumsum[i] + transition_times_cumsum[i + 1])
+ / 2.0,
+ )
+ )
+ )
+ transition_sphere = self._server.scene.add_icosphere(
+ str(Path(self._scene_node_prefix) / f"camera_spline/transition_{i}"),
+ radius=0.04,
+ color=(255, 0, 0),
+ position=transition_pos,
+ )
+ self._spline_nodes.append(transition_sphere)
+
+ @transition_sphere.on_click
+ def _(_) -> None:
+ server = self._server
+
+ if self._camera_edit_panel is not None:
+ self._camera_edit_panel.remove()
+ self._camera_edit_panel = None
+
+ keyframe_index = (i + 1) % len(self._keyframes)
+ keyframe = keyframes[keyframe_index][0]
+
+ with server.scene.add_3d_gui_container(
+ "/camera_edit_panel",
+ position=transition_pos,
+ ) as camera_edit_panel:
+ self._camera_edit_panel = camera_edit_panel
+ override_transition_enabled = server.gui.add_checkbox(
+ "Override transition",
+ initial_value=keyframe.override_transition_enabled,
+ )
+ override_transition_sec = server.gui.add_number(
+ "Override transition (sec)",
+ initial_value=(
+ keyframe.override_transition_sec
+ if keyframe.override_transition_sec is not None
+ else self.default_transition_sec
+ ),
+ min=0.001,
+ max=30.0,
+ step=0.001,
+ disabled=not override_transition_enabled.value,
+ )
+ close_button = server.gui.add_button("Close")
+
+ @override_transition_enabled.on_update
+ def _(_) -> None:
+ keyframe.override_transition_enabled = (
+ override_transition_enabled.value
+ )
+ override_transition_sec.disabled = (
+ not override_transition_enabled.value
+ )
+ self._duration_element.value = self.compute_duration()
+
+ @override_transition_sec.on_update
+ def _(_) -> None:
+ keyframe.override_transition_sec = override_transition_sec.value
+ self._duration_element.value = self.compute_duration()
+
+ @close_button.on_click
+ def _(_) -> None:
+ assert camera_edit_panel is not None
+ camera_edit_panel.remove()
+ self._camera_edit_panel = None
+
+ (num_transitions_plus_1,) = transition_times_cumsum.shape
+ for i in range(num_transitions_plus_1 - 1):
+ make_transition_handle(i)
+
+ def compute_duration(self) -> float:
+ """Compute the total duration of the trajectory."""
+ total = 0.0
+ for i, (keyframe, frustum) in enumerate(self._keyframes.values()):
+ if i == 0 and not self.loop:
+ continue
+ del frustum
+ total += (
+ keyframe.override_transition_sec
+ if keyframe.override_transition_enabled
+ and keyframe.override_transition_sec is not None
+ else self.default_transition_sec
+ )
+ return total
+
+ def compute_transition_times_cumsum(self) -> np.ndarray:
+ """Compute the total duration of the trajectory."""
+ total = 0.0
+ out = [0.0]
+ for i, (keyframe, frustum) in enumerate(self._keyframes.values()):
+ if i == 0:
+ continue
+ del frustum
+ total += (
+ keyframe.override_transition_sec
+ if keyframe.override_transition_enabled
+ and keyframe.override_transition_sec is not None
+ else self.default_transition_sec
+ )
+ out.append(total)
+
+ if self.loop:
+ keyframe = next(iter(self._keyframes.values()))[0]
+ total += (
+ keyframe.override_transition_sec
+ if keyframe.override_transition_enabled
+ and keyframe.override_transition_sec is not None
+ else self.default_transition_sec
+ )
+ out.append(total)
+
+ return np.array(out)
+
+
+@dataclasses.dataclass
+class GuiState:
+ preview_render: bool
+ preview_fov: float
+ preview_aspect: float
+ camera_traj_list: list | None
+ active_input_index: int
+
+
+def define_gui(
+ server: viser.ViserServer,
+ init_fov: float = 75.0,
+ img_wh: tuple[int, int] = (576, 576),
+ **kwargs,
+) -> GuiState:
+ gui_state = GuiState(
+ preview_render=False,
+ preview_fov=0.0,
+ preview_aspect=1.0,
+ camera_traj_list=None,
+ active_input_index=0,
+ )
+
+ with server.gui.add_folder(
+ "Preset camera trajectories", order=99, expand_by_default=False
+ ):
+ preset_traj_dropdown = server.gui.add_dropdown(
+ "Options",
+ [
+ "orbit",
+ "spiral",
+ "lemniscate",
+ "zoom-out",
+ "dolly zoom-out",
+ ],
+ initial_value="orbit",
+ hint="Select a preset camera trajectory.",
+ )
+ preset_duration_num = server.gui.add_number(
+ "Duration (sec)",
+ min=1.0,
+ max=60.0,
+ step=0.5,
+ initial_value=2.0,
+ )
+ preset_submit_button = server.gui.add_button(
+ "Submit",
+ icon=viser.Icon.PICK,
+ hint="Add a new keyframe at the current pose.",
+ )
+
+ @preset_submit_button.on_click
+ def _(event: viser.GuiEvent) -> None:
+ camera_traj.reset()
+ gui_state.camera_traj_list = None
+
+ duration = preset_duration_num.value
+ fps = framerate_number.value
+ num_frames = int(duration * fps)
+ transition_sec = duration / num_frames
+ transition_sec_number.value = transition_sec
+ assert event.client_id is not None
+ transition_sec_number.disabled = True
+ loop_checkbox.disabled = True
+ add_keyframe_button.disabled = True
+
+ camera = server.get_clients()[event.client_id].camera
+ start_w2c = torch.linalg.inv(
+ torch.as_tensor(
+ vt.SE3.from_rotation_and_translation(
+ vt.SO3(camera.wxyz), camera.position
+ ).as_matrix(),
+ dtype=torch.float32,
+ )
+ )
+ look_at = torch.as_tensor(camera.look_at, dtype=torch.float32)
+ up_direction = torch.as_tensor(camera.up_direction, dtype=torch.float32)
+ poses, fovs = get_preset_pose_fov(
+ option=preset_traj_dropdown.value, # type: ignore
+ num_frames=num_frames,
+ start_w2c=start_w2c,
+ look_at=look_at,
+ up_direction=up_direction,
+ fov=camera.fov,
+ )
+ assert poses is not None and fovs is not None
+ for pose, fov in zip(poses, fovs):
+ camera_traj.add_camera(
+ Keyframe.from_se3(
+ vt.SE3.from_matrix(pose),
+ fov=fov,
+ aspect=img_wh[0] / img_wh[1],
+ )
+ )
+
+ duration_number.value = camera_traj.compute_duration()
+ camera_traj.update_spline()
+
+ with server.gui.add_folder("Advanced", expand_by_default=False, order=100):
+ transition_sec_number = server.gui.add_number(
+ "Transition (sec)",
+ min=0.001,
+ max=30.0,
+ step=0.001,
+ initial_value=1.5,
+ hint="Time in seconds between each keyframe, which can also be overridden on a per-transition basis.",
+ )
+ framerate_number = server.gui.add_number(
+ "FPS", min=0.1, max=240.0, step=1e-2, initial_value=30.0
+ )
+ framerate_buttons = server.gui.add_button_group("", ("24", "30", "60"))
+ duration_number = server.gui.add_number(
+ "Duration (sec)",
+ min=0.0,
+ max=1e8,
+ step=0.001,
+ initial_value=0.0,
+ disabled=True,
+ )
+
+ @framerate_buttons.on_click
+ def _(_) -> None:
+ framerate_number.value = float(framerate_buttons.value)
+
+ fov_degree_slider = server.gui.add_slider(
+ "FOV",
+ initial_value=init_fov,
+ min=0.1,
+ max=175.0,
+ step=0.01,
+ hint="Field-of-view for rendering, which can also be overridden on a per-keyframe basis.",
+ )
+
+ @fov_degree_slider.on_update
+ def _(_) -> None:
+ fov_radians = fov_degree_slider.value / 180.0 * np.pi
+ for client in server.get_clients().values():
+ client.camera.fov = fov_radians
+ camera_traj.default_fov = fov_radians
+
+ # Updating the aspect ratio will also re-render the camera frustums.
+ # Could rethink this.
+ camera_traj.update_aspect(img_wh[0] / img_wh[1])
+ compute_and_update_preview_camera_state()
+
+ scene_node_prefix = "/render_assets"
+ base_scene_node = server.scene.add_frame(scene_node_prefix, show_axes=False)
+ add_keyframe_button = server.gui.add_button(
+ "Add keyframe",
+ icon=viser.Icon.PLUS,
+ hint="Add a new keyframe at the current pose.",
+ )
+
+ @add_keyframe_button.on_click
+ def _(event: viser.GuiEvent) -> None:
+ assert event.client_id is not None
+ camera = server.get_clients()[event.client_id].camera
+ pose = vt.SE3.from_rotation_and_translation(
+ vt.SO3(camera.wxyz), camera.position
+ )
+ print(f"client {event.client_id} at {camera.position} {camera.wxyz}")
+ print(f"camera pose {pose.as_matrix()}")
+
+ # Add this camera to the trajectory.
+ camera_traj.add_camera(
+ Keyframe.from_camera(
+ camera,
+ aspect=img_wh[0] / img_wh[1],
+ ),
+ )
+ duration_number.value = camera_traj.compute_duration()
+ camera_traj.update_spline()
+
+ clear_keyframes_button = server.gui.add_button(
+ "Clear keyframes",
+ icon=viser.Icon.TRASH,
+ hint="Remove all keyframes from the render trajectory.",
+ )
+
+ @clear_keyframes_button.on_click
+ def _(event: viser.GuiEvent) -> None:
+ assert event.client_id is not None
+ client = server.get_clients()[event.client_id]
+ with client.atomic(), client.gui.add_modal("Confirm") as modal:
+ client.gui.add_markdown("Clear all keyframes?")
+ confirm_button = client.gui.add_button(
+ "Yes", color="red", icon=viser.Icon.TRASH
+ )
+ exit_button = client.gui.add_button("Cancel")
+
+ @confirm_button.on_click
+ def _(_) -> None:
+ camera_traj.reset()
+ modal.close()
+
+ duration_number.value = camera_traj.compute_duration()
+ add_keyframe_button.disabled = False
+ transition_sec_number.disabled = False
+ transition_sec_number.value = 1.5
+ loop_checkbox.disabled = False
+
+ nonlocal gui_state
+ gui_state.camera_traj_list = None
+
+ @exit_button.on_click
+ def _(_) -> None:
+ modal.close()
+
+ play_button = server.gui.add_button("Play", icon=viser.Icon.PLAYER_PLAY)
+ pause_button = server.gui.add_button(
+ "Pause", icon=viser.Icon.PLAYER_PAUSE, visible=False
+ )
+
+ # Poll the play button to see if we should be playing endlessly.
+ def play() -> None:
+ while True:
+ while not play_button.visible:
+ max_frame = int(framerate_number.value * duration_number.value)
+ if max_frame > 0:
+ assert preview_frame_slider is not None
+ preview_frame_slider.value = (
+ preview_frame_slider.value + 1
+ ) % max_frame
+ time.sleep(1.0 / framerate_number.value)
+ time.sleep(0.1)
+
+ threading.Thread(target=play).start()
+
+ # Play the camera trajectory when the play button is pressed.
+ @play_button.on_click
+ def _(_) -> None:
+ play_button.visible = False
+ pause_button.visible = True
+
+ # Play the camera trajectory when the play button is pressed.
+ @pause_button.on_click
+ def _(_) -> None:
+ play_button.visible = True
+ pause_button.visible = False
+
+ preview_render_button = server.gui.add_button(
+ "Preview render",
+ hint="Show a preview of the render in the viewport.",
+ icon=viser.Icon.CAMERA_CHECK,
+ )
+ preview_render_stop_button = server.gui.add_button(
+ "Exit render preview",
+ color="red",
+ icon=viser.Icon.CAMERA_CANCEL,
+ visible=False,
+ )
+
+ @preview_render_button.on_click
+ def _(_) -> None:
+ gui_state.preview_render = True
+ preview_render_button.visible = False
+ preview_render_stop_button.visible = True
+ play_button.visible = False
+ pause_button.visible = True
+ preset_submit_button.disabled = True
+
+ maybe_pose_and_fov_rad = compute_and_update_preview_camera_state()
+ if maybe_pose_and_fov_rad is None:
+ remove_preview_camera()
+ return
+ pose, fov = maybe_pose_and_fov_rad
+ del fov
+
+ # Hide all render assets when we're previewing the render.
+ nonlocal base_scene_node
+ base_scene_node.visible = False
+
+ # Back up and then set camera poses.
+ for client in server.get_clients().values():
+ camera_pose_backup_from_id[client.client_id] = (
+ client.camera.position,
+ client.camera.look_at,
+ client.camera.up_direction,
+ )
+ with client.atomic():
+ client.camera.wxyz = pose.rotation().wxyz
+ client.camera.position = pose.translation()
+
+ def stop_preview_render() -> None:
+ gui_state.preview_render = False
+ preview_render_button.visible = True
+ preview_render_stop_button.visible = False
+ play_button.visible = True
+ pause_button.visible = False
+ preset_submit_button.disabled = False
+
+ # Revert camera poses.
+ for client in server.get_clients().values():
+ if client.client_id not in camera_pose_backup_from_id:
+ continue
+ cam_position, cam_look_at, cam_up = camera_pose_backup_from_id.pop(
+ client.client_id
+ )
+ with client.atomic():
+ client.camera.position = cam_position
+ client.camera.look_at = cam_look_at
+ client.camera.up_direction = cam_up
+ client.flush()
+
+ # Un-hide render assets.
+ nonlocal base_scene_node
+ base_scene_node.visible = True
+ remove_preview_camera()
+
+ @preview_render_stop_button.on_click
+ def _(_) -> None:
+ stop_preview_render()
+
+ def get_max_frame_index() -> int:
+ return max(1, int(framerate_number.value * duration_number.value) - 1)
+
+ def add_preview_frame_slider() -> viser.GuiInputHandle[int] | None:
+ """Helper for creating the current frame # slider. This is removed and
+ re-added anytime the `max` value changes."""
+
+ preview_frame_slider = server.gui.add_slider(
+ "Preview frame",
+ min=0,
+ max=get_max_frame_index(),
+ step=1,
+ initial_value=0,
+ order=set_traj_button.order + 0.01,
+ disabled=get_max_frame_index() == 1,
+ )
+ play_button.disabled = preview_frame_slider.disabled
+ preview_render_button.disabled = preview_frame_slider.disabled
+ set_traj_button.disabled = preview_frame_slider.disabled
+
+ @preview_frame_slider.on_update
+ def _(_) -> None:
+ nonlocal preview_camera_handle
+ maybe_pose_and_fov_rad = compute_and_update_preview_camera_state()
+ if maybe_pose_and_fov_rad is None:
+ return
+ pose, fov_rad = maybe_pose_and_fov_rad
+
+ preview_camera_handle = server.scene.add_camera_frustum(
+ str(Path(scene_node_prefix) / "preview_camera"),
+ fov=fov_rad,
+ aspect=img_wh[0] / img_wh[1],
+ scale=0.35,
+ wxyz=pose.rotation().wxyz,
+ position=pose.translation(),
+ color=(10, 200, 30),
+ )
+ if gui_state.preview_render:
+ for client in server.get_clients().values():
+ with client.atomic():
+ client.camera.wxyz = pose.rotation().wxyz
+ client.camera.position = pose.translation()
+
+ return preview_frame_slider
+
+ set_traj_button = server.gui.add_button(
+ "Set camera trajectory",
+ color="green",
+ icon=viser.Icon.CHECK,
+ hint="Save the camera trajectory for rendering.",
+ )
+
+ @set_traj_button.on_click
+ def _(event: viser.GuiEvent) -> None:
+ assert event.client is not None
+ num_frames = int(framerate_number.value * duration_number.value)
+
+ def get_intrinsics(W, H, fov_rad):
+ focal = 0.5 * H / np.tan(0.5 * fov_rad)
+ return np.array(
+ [[focal, 0.0, 0.5 * W], [0.0, focal, 0.5 * H], [0.0, 0.0, 1.0]]
+ )
+
+ camera_traj_list = []
+ for i in range(num_frames):
+ maybe_pose_and_fov_rad = camera_traj.interpolate_pose_and_fov_rad(
+ i / num_frames
+ )
+ if maybe_pose_and_fov_rad is None:
+ return
+ pose, fov_rad = maybe_pose_and_fov_rad
+ H = img_wh[1]
+ W = img_wh[0]
+ K = get_intrinsics(W, H, fov_rad)
+ w2c = pose.inverse().as_matrix()
+ camera_traj_list.append(
+ {
+ "w2c": w2c.flatten().tolist(),
+ "K": K.flatten().tolist(),
+ "img_wh": (W, H),
+ }
+ )
+ nonlocal gui_state
+ gui_state.camera_traj_list = camera_traj_list
+ print(f"Get camera_traj_list: {gui_state.camera_traj_list}")
+
+ stop_preview_render()
+
+ preview_frame_slider = add_preview_frame_slider()
+
+ loop_checkbox = server.gui.add_checkbox(
+ "Loop", False, hint="Add a segment between the first and last keyframes."
+ )
+
+ @loop_checkbox.on_update
+ def _(_) -> None:
+ camera_traj.loop = loop_checkbox.value
+ duration_number.value = camera_traj.compute_duration()
+
+ @transition_sec_number.on_update
+ def _(_) -> None:
+ camera_traj.default_transition_sec = transition_sec_number.value
+ duration_number.value = camera_traj.compute_duration()
+
+ preview_camera_handle: viser.SceneNodeHandle | None = None
+
+ def remove_preview_camera() -> None:
+ nonlocal preview_camera_handle
+ if preview_camera_handle is not None:
+ preview_camera_handle.remove()
+ preview_camera_handle = None
+
+ def compute_and_update_preview_camera_state() -> tuple[vt.SE3, float] | None:
+ """Update the render tab state with the current preview camera pose.
+ Returns current camera pose + FOV if available."""
+
+ if preview_frame_slider is None:
+ return None
+ maybe_pose_and_fov_rad = camera_traj.interpolate_pose_and_fov_rad(
+ preview_frame_slider.value / get_max_frame_index()
+ )
+ if maybe_pose_and_fov_rad is None:
+ remove_preview_camera()
+ return None
+ pose, fov_rad = maybe_pose_and_fov_rad
+ gui_state.preview_fov = fov_rad
+ gui_state.preview_aspect = camera_traj.get_aspect()
+ return pose, fov_rad
+
+ # We back up the camera poses before and after we start previewing renders.
+ camera_pose_backup_from_id: dict[int, tuple] = {}
+
+ # Update the # of frames.
+ @duration_number.on_update
+ @framerate_number.on_update
+ def _(_) -> None:
+ remove_preview_camera() # Will be re-added when slider is updated.
+
+ nonlocal preview_frame_slider
+ old = preview_frame_slider
+ assert old is not None
+
+ preview_frame_slider = add_preview_frame_slider()
+ if preview_frame_slider is not None:
+ old.remove()
+ else:
+ preview_frame_slider = old
+
+ camera_traj.framerate = framerate_number.value
+ camera_traj.update_spline()
+
+ camera_traj = CameraTrajectory(
+ server,
+ duration_number,
+ scene_node_prefix=scene_node_prefix,
+ **kwargs,
+ )
+ camera_traj.default_fov = fov_degree_slider.value / 180.0 * np.pi
+ camera_traj.default_transition_sec = transition_sec_number.value
+
+ return gui_state
diff --git a/seva/model.py b/seva/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5d719c774be3154419f77c731b3cc0743245bba
--- /dev/null
+++ b/seva/model.py
@@ -0,0 +1,234 @@
+from dataclasses import dataclass, field
+
+import torch
+import torch.nn as nn
+
+from seva.modules.layers import (
+ Downsample,
+ GroupNorm32,
+ ResBlock,
+ TimestepEmbedSequential,
+ Upsample,
+ timestep_embedding,
+)
+from seva.modules.transformer import MultiviewTransformer
+
+
+@dataclass
+class SevaParams(object):
+ in_channels: int = 11
+ model_channels: int = 320
+ out_channels: int = 4
+ num_frames: int = 21
+ num_res_blocks: int = 2
+ attention_resolutions: list[int] = field(default_factory=lambda: [4, 2, 1])
+ channel_mult: list[int] = field(default_factory=lambda: [1, 2, 4, 4])
+ num_head_channels: int = 64
+ transformer_depth: list[int] = field(default_factory=lambda: [1, 1, 1, 1])
+ context_dim: int = 1024
+ dense_in_channels: int = 6
+ dropout: float = 0.0
+ unflatten_names: list[str] = field(
+ default_factory=lambda: ["middle_ds8", "output_ds4", "output_ds2"]
+ )
+
+ def __post_init__(self):
+ assert len(self.channel_mult) == len(self.transformer_depth)
+
+
+class Seva(nn.Module):
+ def __init__(self, params: SevaParams) -> None:
+ super().__init__()
+ self.params = params
+ self.model_channels = params.model_channels
+ self.out_channels = params.out_channels
+ self.num_head_channels = params.num_head_channels
+
+ time_embed_dim = params.model_channels * 4
+ self.time_embed = nn.Sequential(
+ nn.Linear(params.model_channels, time_embed_dim),
+ nn.SiLU(),
+ nn.Linear(time_embed_dim, time_embed_dim),
+ )
+
+ self.input_blocks = nn.ModuleList(
+ [
+ TimestepEmbedSequential(
+ nn.Conv2d(params.in_channels, params.model_channels, 3, padding=1)
+ )
+ ]
+ )
+ self._feature_size = params.model_channels
+ input_block_chans = [params.model_channels]
+ ch = params.model_channels
+ ds = 1
+ for level, mult in enumerate(params.channel_mult):
+ for _ in range(params.num_res_blocks):
+ input_layers: list[ResBlock | MultiviewTransformer | Downsample] = [
+ ResBlock(
+ channels=ch,
+ emb_channels=time_embed_dim,
+ out_channels=mult * params.model_channels,
+ dense_in_channels=params.dense_in_channels,
+ dropout=params.dropout,
+ )
+ ]
+ ch = mult * params.model_channels
+ if ds in params.attention_resolutions:
+ num_heads = ch // params.num_head_channels
+ dim_head = params.num_head_channels
+ input_layers.append(
+ MultiviewTransformer(
+ ch,
+ num_heads,
+ dim_head,
+ name=f"input_ds{ds}",
+ depth=params.transformer_depth[level],
+ context_dim=params.context_dim,
+ unflatten_names=params.unflatten_names,
+ )
+ )
+ self.input_blocks.append(TimestepEmbedSequential(*input_layers))
+ self._feature_size += ch
+ input_block_chans.append(ch)
+ if level != len(params.channel_mult) - 1:
+ ds *= 2
+ out_ch = ch
+ self.input_blocks.append(
+ TimestepEmbedSequential(Downsample(ch, out_channels=out_ch))
+ )
+ ch = out_ch
+ input_block_chans.append(ch)
+ self._feature_size += ch
+
+ num_heads = ch // params.num_head_channels
+ dim_head = params.num_head_channels
+
+ self.middle_block = TimestepEmbedSequential(
+ ResBlock(
+ channels=ch,
+ emb_channels=time_embed_dim,
+ out_channels=None,
+ dense_in_channels=params.dense_in_channels,
+ dropout=params.dropout,
+ ),
+ MultiviewTransformer(
+ ch,
+ num_heads,
+ dim_head,
+ name=f"middle_ds{ds}",
+ depth=params.transformer_depth[-1],
+ context_dim=params.context_dim,
+ unflatten_names=params.unflatten_names,
+ ),
+ ResBlock(
+ channels=ch,
+ emb_channels=time_embed_dim,
+ out_channels=None,
+ dense_in_channels=params.dense_in_channels,
+ dropout=params.dropout,
+ ),
+ )
+ self._feature_size += ch
+
+ self.output_blocks = nn.ModuleList([])
+ for level, mult in list(enumerate(params.channel_mult))[::-1]:
+ for i in range(params.num_res_blocks + 1):
+ ich = input_block_chans.pop()
+ output_layers: list[ResBlock | MultiviewTransformer | Upsample] = [
+ ResBlock(
+ channels=ch + ich,
+ emb_channels=time_embed_dim,
+ out_channels=params.model_channels * mult,
+ dense_in_channels=params.dense_in_channels,
+ dropout=params.dropout,
+ )
+ ]
+ ch = params.model_channels * mult
+ if ds in params.attention_resolutions:
+ num_heads = ch // params.num_head_channels
+ dim_head = params.num_head_channels
+
+ output_layers.append(
+ MultiviewTransformer(
+ ch,
+ num_heads,
+ dim_head,
+ name=f"output_ds{ds}",
+ depth=params.transformer_depth[level],
+ context_dim=params.context_dim,
+ unflatten_names=params.unflatten_names,
+ )
+ )
+ if level and i == params.num_res_blocks:
+ out_ch = ch
+ ds //= 2
+ output_layers.append(Upsample(ch, out_ch))
+ self.output_blocks.append(TimestepEmbedSequential(*output_layers))
+ self._feature_size += ch
+
+ self.out = nn.Sequential(
+ GroupNorm32(32, ch),
+ nn.SiLU(),
+ nn.Conv2d(self.model_channels, params.out_channels, 3, padding=1),
+ )
+
+ def forward(
+ self,
+ x: torch.Tensor,
+ t: torch.Tensor,
+ y: torch.Tensor,
+ dense_y: torch.Tensor,
+ num_frames: int | None = None,
+ ) -> torch.Tensor:
+ num_frames = num_frames or self.params.num_frames
+ t_emb = timestep_embedding(t, self.model_channels)
+ t_emb = self.time_embed(t_emb)
+
+ hs = []
+ h = x
+ for module in self.input_blocks:
+ h = module(
+ h,
+ emb=t_emb,
+ context=y,
+ dense_emb=dense_y,
+ num_frames=num_frames,
+ )
+ hs.append(h)
+ h = self.middle_block(
+ h,
+ emb=t_emb,
+ context=y,
+ dense_emb=dense_y,
+ num_frames=num_frames,
+ )
+ for module in self.output_blocks:
+ h = torch.cat([h, hs.pop()], dim=1)
+ h = module(
+ h,
+ emb=t_emb,
+ context=y,
+ dense_emb=dense_y,
+ num_frames=num_frames,
+ )
+ h = h.type(x.dtype)
+ return self.out(h)
+
+
+class SGMWrapper(nn.Module):
+ def __init__(self, module: Seva):
+ super().__init__()
+ self.module = module
+
+ def forward(
+ self, x: torch.Tensor, t: torch.Tensor, c: dict, **kwargs
+ ) -> torch.Tensor:
+ x = torch.cat((x, c.get("concat", torch.Tensor([]).type_as(x))), dim=1)
+ return self.module(
+ x,
+ t=t,
+ y=c["crossattn"],
+ dense_y=c["dense_vector"],
+ **kwargs,
+ )
diff --git a/seva/modules/__init__.py b/seva/modules/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/seva/modules/autoencoder.py b/seva/modules/autoencoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ce7b7d76b926c43c80e207cd2279aeef12050c
--- /dev/null
+++ b/seva/modules/autoencoder.py
@@ -0,0 +1,51 @@
+import torch
+from diffusers.models import AutoencoderKL # type: ignore
+from torch import nn
+
+
+class AutoEncoder(nn.Module):
+ scale_factor: float = 0.18215
+ downsample: int = 8
+
+ def __init__(self, chunk_size: int | None = None):
+ super().__init__()
+ self.module = AutoencoderKL.from_pretrained(
+ "stabilityai/stable-diffusion-2-1-base",
+ subfolder="vae",
+ force_download=False,
+ low_cpu_mem_usage=False,
+ )
+ self.module.eval().requires_grad_(False) # type: ignore
+ self.chunk_size = chunk_size
+
+ def _encode(self, x: torch.Tensor) -> torch.Tensor:
+ return (
+ self.module.encode(x).latent_dist.mean # type: ignore
+ * self.scale_factor
+ )
+
+ def encode(self, x: torch.Tensor, chunk_size: int | None = None) -> torch.Tensor:
+ chunk_size = chunk_size or self.chunk_size
+ if chunk_size is not None:
+ return torch.cat(
+ [self._encode(x_chunk) for x_chunk in x.split(chunk_size)],
+ dim=0,
+ )
+ else:
+ return self._encode(x)
+
+ def _decode(self, z: torch.Tensor) -> torch.Tensor:
+ return self.module.decode(z / self.scale_factor).sample # type: ignore
+
+ def decode(self, z: torch.Tensor, chunk_size: int | None = None) -> torch.Tensor:
+ chunk_size = chunk_size or self.chunk_size
+ if chunk_size is not None:
+ return torch.cat(
+ [self._decode(z_chunk) for z_chunk in z.split(chunk_size)],
+ dim=0,
+ )
+ else:
+ return self._decode(z)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ return self.decode(self.encode(x))
diff --git a/seva/modules/conditioner.py b/seva/modules/conditioner.py
new file mode 100644
index 0000000000000000000000000000000000000000..31915d778c2ca0b118ba424bcb201fe35bf15e09
--- /dev/null
+++ b/seva/modules/conditioner.py
@@ -0,0 +1,39 @@
+import kornia
+import open_clip
+import torch
+from torch import nn
+
+
+class CLIPConditioner(nn.Module):
+ mean: torch.Tensor
+ std: torch.Tensor
+
+ def __init__(self):
+ super().__init__()
+ self.module = open_clip.create_model_and_transforms(
+ "ViT-H-14", pretrained="laion2b_s32b_b79k"
+ )[0]
+ self.module.eval().requires_grad_(False) # type: ignore
+ self.register_buffer(
+ "mean", torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False
+ )
+ self.register_buffer(
+ "std", torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False
+ )
+
+ def preprocess(self, x: torch.Tensor) -> torch.Tensor:
+ x = kornia.geometry.resize(
+ x,
+ (224, 224),
+ interpolation="bicubic",
+ align_corners=True,
+ antialias=True,
+ )
+ x = (x + 1.0) / 2.0
+ x = kornia.enhance.normalize(x, self.mean, self.std)
+ return x
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = self.preprocess(x)
+ x = self.module.encode_image(x)
+ return x
diff --git a/seva/modules/layers.py b/seva/modules/layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab1786d2d5cf6033720f968a1f181a37eb0b1423
--- /dev/null
+++ b/seva/modules/layers.py
@@ -0,0 +1,139 @@
+import math
+
+import torch
+import torch.nn.functional as F
+from einops import repeat
+from torch import nn
+
+from .transformer import MultiviewTransformer
+
+
+def timestep_embedding(
+ timesteps: torch.Tensor,
+ dim: int,
+ max_period: int = 10000,
+ repeat_only: bool = False,
+) -> torch.Tensor:
+ if not repeat_only:
+ half = dim // 2
+ freqs = torch.exp(
+ -math.log(max_period)
+ * torch.arange(start=0, end=half, dtype=torch.float32)
+ / half
+ ).to(device=timesteps.device)
+ args = timesteps[:, None].float() * freqs[None]
+ embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+ if dim % 2:
+ embedding = torch.cat(
+ [embedding, torch.zeros_like(embedding[:, :1])], dim=-1
+ )
+ else:
+ embedding = repeat(timesteps, "b -> b d", d=dim)
+ return embedding
+
+
+class Upsample(nn.Module):
+ def __init__(self, channels: int, out_channels: int | None = None):
+ super().__init__()
+ self.channels = channels
+ self.out_channels = out_channels or channels
+ self.conv = nn.Conv2d(self.channels, self.out_channels, 3, 1, 1)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ assert x.shape[1] == self.channels
+ x = F.interpolate(x, scale_factor=2, mode="nearest")
+ x = self.conv(x)
+ return x
+
+
+class Downsample(nn.Module):
+ def __init__(self, channels: int, out_channels: int | None = None):
+ super().__init__()
+ self.channels = channels
+ self.out_channels = out_channels or channels
+ self.op = nn.Conv2d(self.channels, self.out_channels, 3, 2, 1)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ assert x.shape[1] == self.channels
+ return self.op(x)
+
+
+class GroupNorm32(nn.GroupNorm):
+ def forward(self, input: torch.Tensor) -> torch.Tensor:
+ return super().forward(input.float()).type(input.dtype)
+
+
+class TimestepEmbedSequential(nn.Sequential):
+ def forward( # type: ignore[override]
+ self,
+ x: torch.Tensor,
+ emb: torch.Tensor,
+ context: torch.Tensor,
+ dense_emb: torch.Tensor,
+ num_frames: int,
+ ) -> torch.Tensor:
+ for layer in self:
+ if isinstance(layer, MultiviewTransformer):
+ assert num_frames is not None
+ x = layer(x, context, num_frames)
+ elif isinstance(layer, ResBlock):
+ x = layer(x, emb, dense_emb)
+ else:
+ x = layer(x)
+ return x
+
+
+class ResBlock(nn.Module):
+ def __init__(
+ self,
+ channels: int,
+ emb_channels: int,
+ out_channels: int | None,
+ dense_in_channels: int,
+ dropout: float,
+ ):
+ super().__init__()
+ out_channels = out_channels or channels
+
+ self.in_layers = nn.Sequential(
+ GroupNorm32(32, channels),
+ nn.SiLU(),
+ nn.Conv2d(channels, out_channels, 3, 1, 1),
+ )
+ self.emb_layers = nn.Sequential(
+ nn.SiLU(), nn.Linear(emb_channels, out_channels)
+ )
+ self.dense_emb_layers = nn.Sequential(
+ nn.Conv2d(dense_in_channels, 2 * channels, 1, 1, 0)
+ )
+ self.out_layers = nn.Sequential(
+ GroupNorm32(32, out_channels),
+ nn.SiLU(),
+ nn.Dropout(dropout),
+ nn.Conv2d(out_channels, out_channels, 3, 1, 1),
+ )
+ if out_channels == channels:
+ self.skip_connection = nn.Identity()
+ else:
+ self.skip_connection = nn.Conv2d(channels, out_channels, 1, 1, 0)
+
+ def forward(
+ self, x: torch.Tensor, emb: torch.Tensor, dense_emb: torch.Tensor
+ ) -> torch.Tensor:
+ in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
+ h = in_rest(x)
+ dense = self.dense_emb_layers(
+ F.interpolate(
+ dense_emb, size=h.shape[2:], mode="bilinear", align_corners=True
+ )
+ ).type(h.dtype)
+ dense_scale, dense_shift = torch.chunk(dense, 2, dim=1)
+ h = h * (1 + dense_scale) + dense_shift
+ h = in_conv(h)
+ emb_out = self.emb_layers(emb).type(h.dtype)
+ while len(emb_out.shape) < len(h.shape):
+ emb_out = emb_out[..., None]
+ h = h + emb_out
+ h = self.out_layers(h)
+ h = self.skip_connection(x) + h
+ return h
diff --git a/seva/modules/preprocessor.py b/seva/modules/preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5794463b3bb6892d5311b94c296bb83ea5245bf
--- /dev/null
+++ b/seva/modules/preprocessor.py
@@ -0,0 +1,116 @@
+import contextlib
+import os
+import os.path as osp
+import sys
+from typing import cast
+
+import imageio.v3 as iio
+import numpy as np
+import torch
+
+
+class Dust3rPipeline(object):
+ def __init__(self, device: str | torch.device = "cuda"):
+ submodule_path = osp.realpath(
+ osp.join(osp.dirname(__file__), "../../third_party/dust3r/")
+ )
+ if submodule_path not in sys.path:
+ sys.path.insert(0, submodule_path)
+ try:
+ with open(os.devnull, "w") as f, contextlib.redirect_stdout(f):
+ from dust3r.cloud_opt import ( # type: ignore[import]
+ GlobalAlignerMode,
+ global_aligner,
+ )
+ from dust3r.image_pairs import make_pairs # type: ignore[import]
+ from dust3r.inference import inference # type: ignore[import]
+ from dust3r.model import AsymmetricCroCo3DStereo # type: ignore[import]
+ from dust3r.utils.image import load_images # type: ignore[import]
+ except ImportError:
+ raise ImportError(
+ "Missing required submodule: 'dust3r'. Please ensure that all submodules are properly set up.\n\n"
+ "To initialize them, run the following command in the project root:\n"
+ " git submodule update --init --recursive"
+ )
+
+ self.device = torch.device(device)
+ self.model = AsymmetricCroCo3DStereo.from_pretrained(
+ "naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt"
+ ).to(self.device)
+
+ self._GlobalAlignerMode = GlobalAlignerMode
+ self._global_aligner = global_aligner
+ self._make_pairs = make_pairs
+ self._inference = inference
+ self._load_images = load_images
+
+ def infer_cameras_and_points(
+ self,
+ img_paths: list[str],
+ Ks: list[list] = None,
+ c2ws: list[list] = None,
+ batch_size: int = 16,
+ schedule: str = "cosine",
+ lr: float = 0.01,
+ niter: int = 500,
+ min_conf_thr: int = 3,
+ ) -> tuple[
+ list[np.ndarray], np.ndarray, np.ndarray, list[np.ndarray], list[np.ndarray]
+ ]:
+ num_img = len(img_paths)
+ if num_img == 1:
+ print("Only one image found, duplicating it to create a stereo pair.")
+ img_paths = img_paths * 2
+
+ images = self._load_images(img_paths, size=512)
+ pairs = self._make_pairs(
+ images,
+ scene_graph="complete",
+ prefilter=None,
+ symmetrize=True,
+ )
+ output = self._inference(pairs, self.model, self.device, batch_size=batch_size)
+
+ ori_imgs = [iio.imread(p) for p in img_paths]
+ ori_img_whs = np.array([img.shape[1::-1] for img in ori_imgs])
+ img_whs = np.concatenate([image["true_shape"][:, ::-1] for image in images], 0)
+
+ scene = self._global_aligner(
+ output,
+ device=self.device,
+ mode=self._GlobalAlignerMode.PointCloudOptimizer,
+ same_focals=True,
+ optimize_pp=False, # True,
+ min_conf_thr=min_conf_thr,
+ )
+
+ # if Ks is not None:
+ # scene.preset_focal(
+ # torch.tensor([[K[0, 0], K[1, 1]] for K in Ks])
+ # )
+
+ if c2ws is not None:
+ scene.preset_pose(c2ws)
+
+ _ = scene.compute_global_alignment(
+ init="msp", niter=niter, schedule=schedule, lr=lr
+ )
+
+ imgs = cast(list, scene.imgs)
+ Ks = scene.get_intrinsics().detach().cpu().numpy().copy()
+ c2ws = scene.get_im_poses().detach().cpu().numpy() # type: ignore
+ pts3d = [x.detach().cpu().numpy() for x in scene.get_pts3d()] # type: ignore
+ if num_img > 1:
+ masks = [x.detach().cpu().numpy() for x in scene.get_masks()]
+ points = [p[m] for p, m in zip(pts3d, masks)]
+ point_colors = [img[m] for img, m in zip(imgs, masks)]
+ else:
+ points = [p.reshape(-1, 3) for p in pts3d]
+ point_colors = [img.reshape(-1, 3) for img in imgs]
+
+ # Convert back to the original image size.
+ imgs = ori_imgs
+ Ks[:, :2, -1] *= ori_img_whs / img_whs
+ Ks[:, :2, :2] *= (ori_img_whs / img_whs).mean(axis=1, keepdims=True)[..., None]
+
+ return imgs, Ks, c2ws, points, point_colors
diff --git a/seva/modules/transformer.py b/seva/modules/transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0941cd1e057ff341f19a5b76f784386b123edd23
--- /dev/null
+++ b/seva/modules/transformer.py
@@ -0,0 +1,247 @@
+import torch
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from torch import nn
+from torch.nn.attention import SDPBackend, sdpa_kernel
+
+
+class GEGLU(nn.Module):
+ def __init__(self, dim_in: int, dim_out: int):
+ super().__init__()
+ self.proj = nn.Linear(dim_in, dim_out * 2)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x, gate = self.proj(x).chunk(2, dim=-1)
+ return x * F.gelu(gate)
+
+
+class FeedForward(nn.Module):
+ def __init__(
+ self,
+ dim: int,
+ dim_out: int | None = None,
+ mult: int = 4,
+ dropout: float = 0.0,
+ ):
+ super().__init__()
+ inner_dim = int(dim * mult)
+ dim_out = dim_out or dim
+ self.net = nn.Sequential(
+ GEGLU(dim, inner_dim), nn.Dropout(dropout), nn.Linear(inner_dim, dim_out)
+ )
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ return self.net(x)
+
+
+class Attention(nn.Module):
+ def __init__(
+ self,
+ query_dim: int,
+ context_dim: int | None = None,
+ heads: int = 8,
+ dim_head: int = 64,
+ dropout: float = 0.0,
+ ):
+ super().__init__()
+ self.heads = heads
+ self.dim_head = dim_head
+ inner_dim = dim_head * heads
+ context_dim = context_dim or query_dim
+
+ self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
+ self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
+ self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
+ self.to_out = nn.Sequential(
+ nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)
+ )
+
+ def forward(
+ self, x: torch.Tensor, context: torch.Tensor | None = None
+ ) -> torch.Tensor:
+ q = self.to_q(x)
+ context = context if context is not None else x
+ k = self.to_k(context)
+ v = self.to_v(context)
+ q, k, v = map(
+ lambda t: rearrange(t, "b l (h d) -> b h l d", h=self.heads),
+ (q, k, v),
+ )
+ with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
+ out = F.scaled_dot_product_attention(q, k, v)
+ out = rearrange(out, "b h l d -> b l (h d)")
+ out = self.to_out(out)
+ return out
+
+
+class TransformerBlock(nn.Module):
+ def __init__(
+ self,
+ dim: int,
+ n_heads: int,
+ d_head: int,
+ context_dim: int,
+ dropout: float = 0.0,
+ ):
+ super().__init__()
+ self.attn1 = Attention(
+ query_dim=dim,
+ context_dim=None,
+ heads=n_heads,
+ dim_head=d_head,
+ dropout=dropout,
+ )
+ self.ff = FeedForward(dim, dropout=dropout)
+ self.attn2 = Attention(
+ query_dim=dim,
+ context_dim=context_dim,
+ heads=n_heads,
+ dim_head=d_head,
+ dropout=dropout,
+ )
+ self.norm1 = nn.LayerNorm(dim)
+ self.norm2 = nn.LayerNorm(dim)
+ self.norm3 = nn.LayerNorm(dim)
+
+ def forward(self, x: torch.Tensor, context: torch.Tensor) -> torch.Tensor:
+ x = self.attn1(self.norm1(x)) + x
+ x = self.attn2(self.norm2(x), context=context) + x
+ x = self.ff(self.norm3(x)) + x
+ return x
+
+
+class TransformerBlockTimeMix(nn.Module):
+ def __init__(
+ self,
+ dim: int,
+ n_heads: int,
+ d_head: int,
+ context_dim: int,
+ dropout: float = 0.0,
+ ):
+ super().__init__()
+ inner_dim = n_heads * d_head
+ self.norm_in = nn.LayerNorm(dim)
+ self.ff_in = FeedForward(dim, dim_out=inner_dim, dropout=dropout)
+ self.attn1 = Attention(
+ query_dim=inner_dim,
+ context_dim=None,
+ heads=n_heads,
+ dim_head=d_head,
+ dropout=dropout,
+ )
+ self.ff = FeedForward(inner_dim, dim_out=dim, dropout=dropout)
+ self.attn2 = Attention(
+ query_dim=inner_dim,
+ context_dim=context_dim,
+ heads=n_heads,
+ dim_head=d_head,
+ dropout=dropout,
+ )
+ self.norm1 = nn.LayerNorm(inner_dim)
+ self.norm2 = nn.LayerNorm(inner_dim)
+ self.norm3 = nn.LayerNorm(inner_dim)
+
+ def forward(
+ self, x: torch.Tensor, context: torch.Tensor, num_frames: int
+ ) -> torch.Tensor:
+ _, s, _ = x.shape
+ x = rearrange(x, "(b t) s c -> (b s) t c", t=num_frames)
+ x = self.ff_in(self.norm_in(x)) + x
+ x = self.attn1(self.norm1(x), context=None) + x
+ x = self.attn2(self.norm2(x), context=context) + x
+ x = self.ff(self.norm3(x))
+ x = rearrange(x, "(b s) t c -> (b t) s c", s=s)
+ return x
+
+
+class SkipConnect(nn.Module):
+ def __init__(self):
+ super().__init__()
+
+ def forward(
+ self, x_spatial: torch.Tensor, x_temporal: torch.Tensor
+ ) -> torch.Tensor:
+ return x_spatial + x_temporal
+
+
+class MultiviewTransformer(nn.Module):
+ def __init__(
+ self,
+ in_channels: int,
+ n_heads: int,
+ d_head: int,
+ name: str,
+ unflatten_names: list[str] = [],
+ depth: int = 1,
+ context_dim: int = 1024,
+ dropout: float = 0.0,
+ ):
+ super().__init__()
+ self.in_channels = in_channels
+ self.name = name
+ self.unflatten_names = unflatten_names
+
+ inner_dim = n_heads * d_head
+ self.norm = nn.GroupNorm(32, in_channels, eps=1e-6)
+ self.proj_in = nn.Linear(in_channels, inner_dim)
+ self.transformer_blocks = nn.ModuleList(
+ [
+ TransformerBlock(
+ inner_dim,
+ n_heads,
+ d_head,
+ context_dim=context_dim,
+ dropout=dropout,
+ )
+ for _ in range(depth)
+ ]
+ )
+ self.proj_out = nn.Linear(inner_dim, in_channels)
+ self.time_mixer = SkipConnect()
+ self.time_mix_blocks = nn.ModuleList(
+ [
+ TransformerBlockTimeMix(
+ inner_dim,
+ n_heads,
+ d_head,
+ context_dim=context_dim,
+ dropout=dropout,
+ )
+ for _ in range(depth)
+ ]
+ )
+
+ def forward(
+ self, x: torch.Tensor, context: torch.Tensor, num_frames: int
+ ) -> torch.Tensor:
+ assert context.ndim == 3
+ _, _, h, w = x.shape
+ x_in = x
+
+ time_context = context
+ time_context_first_timestep = time_context[::num_frames]
+ time_context = repeat(
+ time_context_first_timestep, "b ... -> (b n) ...", n=h * w
+ )
+
+ if self.name in self.unflatten_names:
+ context = context[::num_frames]
+
+ x = self.norm(x)
+ x = rearrange(x, "b c h w -> b (h w) c")
+ x = self.proj_in(x)
+
+ for block, mix_block in zip(self.transformer_blocks, self.time_mix_blocks):
+ if self.name in self.unflatten_names:
+ x = rearrange(x, "(b t) (h w) c -> b (t h w) c", t=num_frames, h=h, w=w)
+ x = block(x, context=context)
+ if self.name in self.unflatten_names:
+ x = rearrange(x, "b (t h w) c -> (b t) (h w) c", t=num_frames, h=h, w=w)
+ x_mix = mix_block(x, context=time_context, num_frames=num_frames)
+ x = self.time_mixer(x_spatial=x, x_temporal=x_mix)
+
+ x = self.proj_out(x)
+ x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w)
+ out = x + x_in
+ return out
diff --git a/seva/sampling.py b/seva/sampling.py
new file mode 100644
index 0000000000000000000000000000000000000000..576226ee3ee1fef3451c430e4ed302a4f1a46a50
--- /dev/null
+++ b/seva/sampling.py
@@ -0,0 +1,405 @@
+import numpy as np
+import torch
+import torch.nn as nn
+from einops import rearrange
+from tqdm import tqdm
+
+from seva.geometry import get_camera_dist
+
+
+def append_dims(x: torch.Tensor, target_dims: int) -> torch.Tensor:
+ """Appends dimensions to the end of a tensor until it has target_dims dimensions."""
+ dims_to_append = target_dims - x.ndim
+ if dims_to_append < 0:
+ raise ValueError(
+ f"input has {x.ndim} dims but target_dims is {target_dims}, which is less"
+ )
+ return x[(...,) + (None,) * dims_to_append]
+
+
+def append_zero(x: torch.Tensor) -> torch.Tensor:
+ return torch.cat([x, x.new_zeros([1])])
+
+
+def to_d(x: torch.Tensor, sigma: torch.Tensor, denoised: torch.Tensor) -> torch.Tensor:
+ return (x - denoised) / append_dims(sigma, x.ndim)
+
+
+def make_betas(
+ num_timesteps: int, linear_start: float = 1e-4, linear_end: float = 2e-2
+) -> np.ndarray:
+ betas = (
+ torch.linspace(
+ linear_start**0.5, linear_end**0.5, num_timesteps, dtype=torch.float64
+ )
+ ** 2
+ )
+ return betas.numpy()
+
+
+def generate_roughly_equally_spaced_steps(
+ num_substeps: int, max_step: int
+) -> np.ndarray:
+ return np.linspace(max_step - 1, 0, num_substeps, endpoint=False).astype(int)[::-1]
+
+
+class EpsScaling(object):
+ def __call__(
+ self, sigma: torch.Tensor
+ ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+ c_skip = torch.ones_like(sigma, device=sigma.device)
+ c_out = -sigma
+ c_in = 1 / (sigma**2 + 1.0) ** 0.5
+ c_noise = sigma.clone()
+ return c_skip, c_out, c_in, c_noise
+
+
+class DDPMDiscretization(object):
+ def __init__(
+ self,
+ linear_start: float = 5e-06,
+ linear_end: float = 0.012,
+ num_timesteps: int = 1000,
+ log_snr_shift: float | None = 2.4,
+ ):
+ self.num_timesteps = num_timesteps
+
+ betas = make_betas(
+ num_timesteps,
+ linear_start=linear_start,
+ linear_end=linear_end,
+ )
+ self.log_snr_shift = log_snr_shift
+
+ alphas = 1.0 - betas # first alpha here is on data side
+ self.alphas_cumprod = np.cumprod(alphas, axis=0)
+
+ def get_sigmas(self, n: int, device: str | torch.device = "cpu") -> torch.Tensor:
+ if n < self.num_timesteps:
+ timesteps = generate_roughly_equally_spaced_steps(n, self.num_timesteps)
+ alphas_cumprod = self.alphas_cumprod[timesteps]
+ elif n == self.num_timesteps:
+ alphas_cumprod = self.alphas_cumprod
+ else:
+ raise ValueError(f"Expected n <= {self.num_timesteps}, but got n = {n}.")
+
+ sigmas = ((1 - alphas_cumprod) / alphas_cumprod) ** 0.5
+ if self.log_snr_shift is not None:
+ sigmas = sigmas * np.exp(self.log_snr_shift)
+ return torch.flip(
+ torch.tensor(sigmas, dtype=torch.float32, device=device), (0,)
+ )
+
+ def __call__(
+ self,
+ n: int,
+ do_append_zero: bool = True,
+ flip: bool = False,
+ device: str | torch.device = "cpu",
+ ) -> torch.Tensor:
+ sigmas = self.get_sigmas(n, device=device)
+ sigmas = append_zero(sigmas) if do_append_zero else sigmas
+ return sigmas if not flip else torch.flip(sigmas, (0,))
+
+
+class DiscreteDenoiser(object):
+ sigmas: torch.Tensor
+
+ def __init__(
+ self,
+ discretization: DDPMDiscretization,
+ num_idx: int = 1000,
+ device: str | torch.device = "cpu",
+ ):
+ self.scaling = EpsScaling()
+ self.discretization = discretization
+ self.num_idx = num_idx
+ self.device = device
+
+ self.register_sigmas()
+
+ def register_sigmas(self):
+ self.sigmas = self.discretization(
+ self.num_idx, do_append_zero=False, flip=True, device=self.device
+ )
+
+ def sigma_to_idx(self, sigma: torch.Tensor) -> torch.Tensor:
+ dists = sigma - self.sigmas[:, None]
+ return dists.abs().argmin(dim=0).view(sigma.shape)
+
+ def idx_to_sigma(self, idx: torch.Tensor | int) -> torch.Tensor:
+ return self.sigmas[idx]
+
+ def __call__(
+ self,
+ network: nn.Module,
+ input: torch.Tensor,
+ sigma: torch.Tensor,
+ cond: dict,
+ **additional_model_inputs,
+ ) -> torch.Tensor:
+ sigma = self.idx_to_sigma(self.sigma_to_idx(sigma))
+ sigma_shape = sigma.shape
+ sigma = append_dims(sigma, input.ndim)
+ c_skip, c_out, c_in, c_noise = self.scaling(sigma)
+ c_noise = self.sigma_to_idx(c_noise.reshape(sigma_shape))
+ if "replace" in cond:
+ x, mask = cond.pop("replace").split((input.shape[1], 1), dim=1)
+ input = input * (1 - mask) + x * mask
+ return (
+ network(input * c_in, c_noise, cond, **additional_model_inputs) * c_out
+ + input * c_skip
+ )
+
+
+class ConstantScaleRule(object):
+ def __call__(self, scale: float | torch.Tensor) -> float | torch.Tensor:
+ return scale
+
+
+class MultiviewScaleRule(object):
+ def __init__(self, min_scale: float = 1.0):
+ self.min_scale = min_scale
+
+ def __call__(
+ self,
+ scale: float | torch.Tensor,
+ c2w: torch.Tensor,
+ K: torch.Tensor,
+ input_frame_mask: torch.Tensor,
+ ) -> torch.Tensor:
+ c2w_input = c2w[input_frame_mask]
+ rotation_diff = get_camera_dist(c2w, c2w_input, mode="rotation").min(-1).values
+ translation_diff = (
+ get_camera_dist(c2w, c2w_input, mode="translation").min(-1).values
+ )
+ K_diff = (
+ ((K[:, None] - K[input_frame_mask][None]).flatten(-2) == 0).all(-1).any(-1)
+ )
+ close_frame = (rotation_diff < 10.0) & (translation_diff < 1e-5) & K_diff
+ if isinstance(scale, torch.Tensor):
+ scale = scale.clone()
+ scale[close_frame] = self.min_scale
+ elif isinstance(scale, float):
+ scale = torch.where(close_frame, self.min_scale, scale)
+ else:
+ raise ValueError(f"Invalid scale type {type(scale)}.")
+ return scale
+
+
+class ConstantScaleSchedule(object):
+ def __call__(
+ self, sigma: float | torch.Tensor, scale: float | torch.Tensor
+ ) -> float | torch.Tensor:
+ if isinstance(sigma, float):
+ return scale
+ elif isinstance(sigma, torch.Tensor):
+ if len(sigma.shape) == 1 and isinstance(scale, torch.Tensor):
+ sigma = append_dims(sigma, scale.ndim)
+ return scale * torch.ones_like(sigma)
+ else:
+ raise ValueError(f"Invalid sigma type {type(sigma)}.")
+
+
+class ConstantGuidance(object):
+ def __call__(
+ self,
+ uncond: torch.Tensor,
+ cond: torch.Tensor,
+ scale: float | torch.Tensor,
+ ) -> torch.Tensor:
+ if isinstance(scale, torch.Tensor) and len(scale.shape) == 1:
+ scale = append_dims(scale, cond.ndim)
+ return uncond + scale * (cond - uncond)
+
+
+class VanillaCFG(object):
+ def __init__(self):
+ self.scale_rule = ConstantScaleRule()
+ self.scale_schedule = ConstantScaleSchedule()
+ self.guidance = ConstantGuidance()
+
+ def __call__(
+ self, x: torch.Tensor, sigma: float | torch.Tensor, scale: float | torch.Tensor
+ ) -> torch.Tensor:
+ x_u, x_c = x.chunk(2)
+ scale = self.scale_rule(scale)
+ scale_value = self.scale_schedule(sigma, scale)
+ x_pred = self.guidance(x_u, x_c, scale_value)
+ return x_pred
+
+ def prepare_inputs(
+ self, x: torch.Tensor, s: torch.Tensor, c: dict, uc: dict
+ ) -> tuple[torch.Tensor, torch.Tensor, dict]:
+ c_out = dict()
+
+ for k in c:
+ if k in ["vector", "crossattn", "concat", "replace", "dense_vector"]:
+ c_out[k] = torch.cat((uc[k], c[k]), 0)
+ else:
+ assert c[k] == uc[k]
+ c_out[k] = c[k]
+ return torch.cat([x] * 2), torch.cat([s] * 2), c_out
+
+
+class MultiviewCFG(VanillaCFG):
+ def __init__(self, cfg_min: float = 1.0):
+ self.scale_min = cfg_min
+ self.scale_rule = MultiviewScaleRule(min_scale=cfg_min)
+ self.scale_schedule = ConstantScaleSchedule()
+ self.guidance = ConstantGuidance()
+
+ def __call__( # type: ignore
+ self,
+ x: torch.Tensor,
+ sigma: float | torch.Tensor,
+ scale: float | torch.Tensor,
+ c2w: torch.Tensor,
+ K: torch.Tensor,
+ input_frame_mask: torch.Tensor,
+ ) -> torch.Tensor:
+ x_u, x_c = x.chunk(2)
+ scale = self.scale_rule(scale, c2w, K, input_frame_mask)
+ scale_value = self.scale_schedule(sigma, scale)
+ x_pred = self.guidance(x_u, x_c, scale_value)
+ return x_pred
+
+
+class MultiviewTemporalCFG(MultiviewCFG):
+ def __init__(self, num_frames: int, cfg_min: float = 1.0):
+ super().__init__(cfg_min=cfg_min)
+
+ self.num_frames = num_frames
+ distance_matrix = (
+ torch.arange(num_frames)[None] - torch.arange(num_frames)[:, None]
+ ).abs()
+ self.distance_matrix = distance_matrix
+
+ def __call__(
+ self,
+ x: torch.Tensor,
+ sigma: float | torch.Tensor,
+ scale: float | torch.Tensor,
+ c2w: torch.Tensor,
+ K: torch.Tensor,
+ input_frame_mask: torch.Tensor,
+ ) -> torch.Tensor:
+ input_frame_mask = rearrange(
+ input_frame_mask, "(b t) ... -> b t ...", t=self.num_frames
+ )
+ min_distance = (
+ self.distance_matrix[None].to(x.device)
+ + (~input_frame_mask[:, None]) * self.num_frames
+ ).min(-1)[0]
+ min_distance = min_distance / min_distance.max(-1, keepdim=True)[0].clamp(min=1)
+ scale = min_distance * (scale - self.scale_min) + self.scale_min
+ scale = rearrange(scale, "b t ... -> (b t) ...")
+ scale = append_dims(scale, x.ndim)
+ return super().__call__(x, sigma, scale, c2w, K, input_frame_mask.flatten(0, 1))
+
+
+class EulerEDMSampler(object):
+ def __init__(
+ self,
+ discretization: DDPMDiscretization,
+ guider: VanillaCFG | MultiviewCFG | MultiviewTemporalCFG,
+ num_steps: int | None = None,
+ verbose: bool = False,
+ device: str | torch.device = "cuda",
+ s_churn=0.0,
+ s_tmin=0.0,
+ s_tmax=float("inf"),
+ s_noise=1.0,
+ ):
+ self.num_steps = num_steps
+ self.discretization = discretization
+ self.guider = guider
+ self.verbose = verbose
+ self.device = device
+
+ self.s_churn = s_churn
+ self.s_tmin = s_tmin
+ self.s_tmax = s_tmax
+ self.s_noise = s_noise
+
+ def prepare_sampling_loop(
+ self, x: torch.Tensor, cond: dict, uc: dict, num_steps: int | None = None
+ ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, int, dict, dict]:
+ num_steps = num_steps or self.num_steps
+ assert num_steps is not None, "num_steps must be specified"
+ sigmas = self.discretization(num_steps, device=self.device)
+ x *= torch.sqrt(1.0 + sigmas[0] ** 2.0)
+ num_sigmas = len(sigmas)
+ s_in = x.new_ones([x.shape[0]])
+ return x, s_in, sigmas, num_sigmas, cond, uc
+
+ def get_sigma_gen(self, num_sigmas: int, verbose: bool = True) -> range | tqdm:
+ sigma_generator = range(num_sigmas - 1)
+ if self.verbose and verbose:
+ sigma_generator = tqdm(
+ sigma_generator,
+ total=num_sigmas - 1,
+ desc="Sampling",
+ leave=False,
+ )
+ return sigma_generator
+
+ def sampler_step(
+ self,
+ sigma: torch.Tensor,
+ next_sigma: torch.Tensor,
+ denoiser,
+ x: torch.Tensor,
+ scale: float | torch.Tensor,
+ cond: dict,
+ uc: dict,
+ gamma: float = 0.0,
+ **guider_kwargs,
+ ) -> torch.Tensor:
+ sigma_hat = sigma * (gamma + 1.0) + 1e-6
+
+ eps = torch.randn_like(x) * self.s_noise
+ x = x + eps * append_dims(sigma_hat**2 - sigma**2, x.ndim) ** 0.5
+
+ denoised = denoiser(*self.guider.prepare_inputs(x, sigma_hat, cond, uc))
+ denoised = self.guider(denoised, sigma_hat, scale, **guider_kwargs)
+ d = to_d(x, sigma_hat, denoised)
+ dt = append_dims(next_sigma - sigma_hat, x.ndim)
+ return x + dt * d
+
+ def __call__(
+ self,
+ denoiser,
+ x: torch.Tensor,
+ scale: float | torch.Tensor,
+ cond: dict,
+ uc: dict | None = None,
+ num_steps: int | None = None,
+ verbose: bool = True,
+ **guider_kwargs,
+ ) -> torch.Tensor:
+ uc = cond if uc is None else uc
+ x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
+ x,
+ cond,
+ uc,
+ num_steps,
+ )
+ for i in self.get_sigma_gen(num_sigmas, verbose=verbose):
+ gamma = (
+ min(self.s_churn / (num_sigmas - 1), 2**0.5 - 1)
+ if self.s_tmin <= sigmas[i] <= self.s_tmax
+ else 0.0
+ )
+ x = self.sampler_step(
+ s_in * sigmas[i],
+ s_in * sigmas[i + 1],
+ denoiser,
+ x,
+ scale,
+ cond,
+ uc,
+ gamma,
+ **guider_kwargs,
+ )
+ return x
diff --git a/seva/utils.py b/seva/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..7934ec233f066f8a849703ae66936649081f9faf
--- /dev/null
+++ b/seva/utils.py
@@ -0,0 +1,56 @@
+import os
+
+import safetensors.torch
+import torch
+from huggingface_hub import hf_hub_download
+
+from seva.model import Seva, SevaParams
+
+
+def seed_everything(seed: int = 0):
+ torch.manual_seed(seed)
+ torch.cuda.manual_seed(seed)
+ torch.cuda.manual_seed_all(seed)
+ torch.backends.cudnn.deterministic = True
+ torch.backends.cudnn.benchmark = False
+
+
+def print_load_warning(missing: list[str], unexpected: list[str]) -> None:
+ if len(missing) > 0 and len(unexpected) > 0:
+ print(f"Got {len(missing)} missing keys:\n\t" + "\n\t".join(missing))
+ print("\n" + "-" * 79 + "\n")
+ print(f"Got {len(unexpected)} unexpected keys:\n\t" + "\n\t".join(unexpected))
+ elif len(missing) > 0:
+ print(f"Got {len(missing)} missing keys:\n\t" + "\n\t".join(missing))
+ elif len(unexpected) > 0:
+ print(f"Got {len(unexpected)} unexpected keys:\n\t" + "\n\t".join(unexpected))
+
+
+def load_model(
+ pretrained_model_name_or_path: str = "stabilityai/stable-virtual-camera",
+ weight_name: str = "model.safetensors",
+ device: str | torch.device = "cuda",
+ verbose: bool = False,
+) -> Seva:
+ if os.path.isdir(pretrained_model_name_or_path):
+ weight_path = os.path.join(pretrained_model_name_or_path, weight_name)
+ else:
+ weight_path = hf_hub_download(
+ repo_id=pretrained_model_name_or_path, filename=weight_name
+ )
+ _ = hf_hub_download(
+ repo_id=pretrained_model_name_or_path, filename="config.yaml"
+ )
+
+ state_dict = safetensors.torch.load_file(
+ weight_path,
+ device=str(device),
+ )
+
+ with torch.device("meta"):
+ model = Seva(SevaParams()).to(torch.bfloat16)
+
+ missing, unexpected = model.load_state_dict(state_dict, strict=False, assign=True)
+ if verbose:
+ print_load_warning(missing, unexpected)
+ return model
diff --git a/third_party/dust3r/.gitignore b/third_party/dust3r/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..194e236cbd708160926c3513b4232285eb47b029
--- /dev/null
+++ b/third_party/dust3r/.gitignore
@@ -0,0 +1,132 @@
+data/
+checkpoints/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+# However, in case of collaboration, if having platform-specific dependencies or dependencies
+# having no cross-platform support, pipenv may install dependencies that don't work, or not
+# install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
diff --git a/third_party/dust3r/.gitmodules b/third_party/dust3r/.gitmodules
new file mode 100644
index 0000000000000000000000000000000000000000..c950ef981a8d2e47599dd7acbbe1bf8de9a42aca
--- /dev/null
+++ b/third_party/dust3r/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "croco"]
+ path = croco
+ url = https://github.com/naver/croco
diff --git a/third_party/dust3r/LICENSE b/third_party/dust3r/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..a97986e3a8ddd49973959f6c748dfa8b881b64d3
--- /dev/null
+++ b/third_party/dust3r/LICENSE
@@ -0,0 +1,7 @@
+DUSt3R, Copyright (c) 2024-present Naver Corporation, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.
+
+A summary of the CC BY-NC-SA 4.0 license is located here:
+ https://creativecommons.org/licenses/by-nc-sa/4.0/
+
+The CC BY-NC-SA 4.0 license is located here:
+ https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
diff --git a/third_party/dust3r/NOTICE b/third_party/dust3r/NOTICE
new file mode 100644
index 0000000000000000000000000000000000000000..81da544dd534c5465361f35cf6a5a0cfff7c1d3f
--- /dev/null
+++ b/third_party/dust3r/NOTICE
@@ -0,0 +1,12 @@
+DUSt3R
+Copyright 2024-present NAVER Corp.
+
+This project contains subcomponents with separate copyright notices and license terms.
+Your use of the source code for these subcomponents is subject to the terms and conditions of the following licenses.
+
+====
+
+naver/croco
+https://github.com/naver/croco/
+
+Creative Commons Attribution-NonCommercial-ShareAlike 4.0
diff --git a/third_party/dust3r/README.md b/third_party/dust3r/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e7c7a4f9328a62e55a93f757fc41dcbca18ef546
--- /dev/null
+++ b/third_party/dust3r/README.md
@@ -0,0 +1,390 @@
+
+
+Official implementation of `DUSt3R: Geometric 3D Vision Made Easy`
+[[Project page](https://dust3r.europe.naverlabs.com/)], [[DUSt3R arxiv](https://arxiv.org/abs/2312.14132)]
+
+> **Make sure to also check [MASt3R](https://github.com/naver/mast3r): Our new model with a local feature head, metric pointmaps, and a more scalable global alignment!**
+
+
+
+
+
+```bibtex
+@inproceedings{dust3r_cvpr24,
+ title={DUSt3R: Geometric 3D Vision Made Easy},
+ author={Shuzhe Wang and Vincent Leroy and Yohann Cabon and Boris Chidlovskii and Jerome Revaud},
+ booktitle = {CVPR},
+ year = {2024}
+}
+
+@misc{dust3r_arxiv23,
+ title={DUSt3R: Geometric 3D Vision Made Easy},
+ author={Shuzhe Wang and Vincent Leroy and Yohann Cabon and Boris Chidlovskii and Jerome Revaud},
+ year={2023},
+ eprint={2312.14132},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
+}
+```
+
+## Table of Contents
+
+- [Table of Contents](#table-of-contents)
+- [License](#license)
+- [Get Started](#get-started)
+ - [Installation](#installation)
+ - [Checkpoints](#checkpoints)
+ - [Interactive demo](#interactive-demo)
+ - [Interactive demo with docker](#interactive-demo-with-docker)
+- [Usage](#usage)
+- [Training](#training)
+ - [Datasets](#datasets)
+ - [Demo](#demo)
+ - [Our Hyperparameters](#our-hyperparameters)
+
+## License
+
+The code is distributed under the CC BY-NC-SA 4.0 License.
+See [LICENSE](LICENSE) for more information.
+
+```python
+# Copyright (C) 2024-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+```
+
+## Get Started
+
+### Installation
+
+1. Clone DUSt3R.
+```bash
+git clone --recursive https://github.com/naver/dust3r
+cd dust3r
+# if you have already cloned dust3r:
+# git submodule update --init --recursive
+```
+
+2. Create the environment, here we show an example using conda.
+```bash
+conda create -n dust3r python=3.11 cmake=3.14.0
+conda activate dust3r
+conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia # use the correct version of cuda for your system
+pip install -r requirements.txt
+# Optional: you can also install additional packages to:
+# - add support for HEIC images
+# - add pyrender, used to render depthmap in some datasets preprocessing
+# - add required packages for visloc.py
+pip install -r requirements_optional.txt
+```
+
+3. Optional, compile the cuda kernels for RoPE (as in CroCo v2).
+```bash
+# DUST3R relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
+cd croco/models/curope/
+python setup.py build_ext --inplace
+cd ../../../
+```
+
+### Checkpoints
+
+You can obtain the checkpoints by two ways:
+
+1) You can use our huggingface_hub integration: the models will be downloaded automatically.
+
+2) Otherwise, We provide several pre-trained models:
+
+| Modelname | Training resolutions | Head | Encoder | Decoder |
+|-------------|----------------------|------|---------|---------|
+| [`DUSt3R_ViTLarge_BaseDecoder_224_linear.pth`](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_224_linear.pth) | 224x224 | Linear | ViT-L | ViT-B |
+| [`DUSt3R_ViTLarge_BaseDecoder_512_linear.pth`](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_linear.pth) | 512x384, 512x336, 512x288, 512x256, 512x160 | Linear | ViT-L | ViT-B |
+| [`DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth`](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth) | 512x384, 512x336, 512x288, 512x256, 512x160 | DPT | ViT-L | ViT-B |
+
+You can check the hyperparameters we used to train these models in the [section: Our Hyperparameters](#our-hyperparameters)
+
+To download a specific model, for example `DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth`:
+```bash
+mkdir -p checkpoints/
+wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth -P checkpoints/
+```
+
+For the checkpoints, make sure to agree to the license of all the public training datasets and base checkpoints we used, in addition to CC-BY-NC-SA 4.0. Again, see [section: Our Hyperparameters](#our-hyperparameters) for details.
+
+### Interactive demo
+
+In this demo, you should be able run DUSt3R on your machine to reconstruct a scene.
+First select images that depicts the same scene.
+
+You can adjust the global alignment schedule and its number of iterations.
+
+> [!NOTE]
+> If you selected one or two images, the global alignment procedure will be skipped (mode=GlobalAlignerMode.PairViewer)
+
+Hit "Run" and wait.
+When the global alignment ends, the reconstruction appears.
+Use the slider "min_conf_thr" to show or remove low confidence areas.
+
+```bash
+python3 demo.py --model_name DUSt3R_ViTLarge_BaseDecoder_512_dpt
+
+# Use --weights to load a checkpoint from a local file, eg --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
+# Use --image_size to select the correct resolution for the selected checkpoint. 512 (default) or 224
+# Use --local_network to make it accessible on the local network, or --server_name to specify the url manually
+# Use --server_port to change the port, by default it will search for an available port starting at 7860
+# Use --device to use a different device, by default it's "cuda"
+```
+
+### Interactive demo with docker
+
+To run DUSt3R using Docker, including with NVIDIA CUDA support, follow these instructions:
+
+1. **Install Docker**: If not already installed, download and install `docker` and `docker compose` from the [Docker website](https://www.docker.com/get-started).
+
+2. **Install NVIDIA Docker Toolkit**: For GPU support, install the NVIDIA Docker toolkit from the [Nvidia website](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
+
+3. **Build the Docker image and run it**: `cd` into the `./docker` directory and run the following commands:
+
+```bash
+cd docker
+bash run.sh --with-cuda --model_name="DUSt3R_ViTLarge_BaseDecoder_512_dpt"
+```
+
+Or if you want to run the demo without CUDA support, run the following command:
+
+```bash
+cd docker
+bash run.sh --model_name="DUSt3R_ViTLarge_BaseDecoder_512_dpt"
+```
+
+By default, `demo.py` is lanched with the option `--local_network`.
+Visit `http://localhost:7860/` to access the web UI (or replace `localhost` with the machine's name to access it from the network).
+
+`run.sh` will launch docker-compose using either the [docker-compose-cuda.yml](docker/docker-compose-cuda.yml) or [docker-compose-cpu.ym](docker/docker-compose-cpu.yml) config file, then it starts the demo using [entrypoint.sh](docker/files/entrypoint.sh).
+
+
+
+
+## Usage
+
+```python
+from dust3r.inference import inference
+from dust3r.model import AsymmetricCroCo3DStereo
+from dust3r.utils.image import load_images
+from dust3r.image_pairs import make_pairs
+from dust3r.cloud_opt import global_aligner, GlobalAlignerMode
+
+if __name__ == '__main__':
+ device = 'cuda'
+ batch_size = 1
+ schedule = 'cosine'
+ lr = 0.01
+ niter = 300
+
+ model_name = "naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt"
+ # you can put the path to a local checkpoint in model_name if needed
+ model = AsymmetricCroCo3DStereo.from_pretrained(model_name).to(device)
+ # load_images can take a list of images or a directory
+ images = load_images(['croco/assets/Chateau1.png', 'croco/assets/Chateau2.png'], size=512)
+ pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
+ output = inference(pairs, model, device, batch_size=batch_size)
+
+ # at this stage, you have the raw dust3r predictions
+ view1, pred1 = output['view1'], output['pred1']
+ view2, pred2 = output['view2'], output['pred2']
+ # here, view1, pred1, view2, pred2 are dicts of lists of len(2)
+ # -> because we symmetrize we have (im1, im2) and (im2, im1) pairs
+ # in each view you have:
+ # an integer image identifier: view1['idx'] and view2['idx']
+ # the img: view1['img'] and view2['img']
+ # the image shape: view1['true_shape'] and view2['true_shape']
+ # an instance string output by the dataloader: view1['instance'] and view2['instance']
+ # pred1 and pred2 contains the confidence values: pred1['conf'] and pred2['conf']
+ # pred1 contains 3D points for view1['img'] in view1['img'] space: pred1['pts3d']
+ # pred2 contains 3D points for view2['img'] in view1['img'] space: pred2['pts3d_in_other_view']
+
+ # next we'll use the global_aligner to align the predictions
+ # depending on your task, you may be fine with the raw output and not need it
+ # with only two input images, you could use GlobalAlignerMode.PairViewer: it would just convert the output
+ # if using GlobalAlignerMode.PairViewer, no need to run compute_global_alignment
+ scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
+ loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)
+
+ # retrieve useful values from scene:
+ imgs = scene.imgs
+ focals = scene.get_focals()
+ poses = scene.get_im_poses()
+ pts3d = scene.get_pts3d()
+ confidence_masks = scene.get_masks()
+
+ # visualize reconstruction
+ scene.show()
+
+ # find 2D-2D matches between the two images
+ from dust3r.utils.geometry import find_reciprocal_matches, xy_grid
+ pts2d_list, pts3d_list = [], []
+ for i in range(2):
+ conf_i = confidence_masks[i].cpu().numpy()
+ pts2d_list.append(xy_grid(*imgs[i].shape[:2][::-1])[conf_i]) # imgs[i].shape[:2] = (H, W)
+ pts3d_list.append(pts3d[i].detach().cpu().numpy()[conf_i])
+ reciprocal_in_P2, nn2_in_P1, num_matches = find_reciprocal_matches(*pts3d_list)
+ print(f'found {num_matches} matches')
+ matches_im1 = pts2d_list[1][reciprocal_in_P2]
+ matches_im0 = pts2d_list[0][nn2_in_P1][reciprocal_in_P2]
+
+ # visualize a few matches
+ import numpy as np
+ from matplotlib import pyplot as pl
+ n_viz = 10
+ match_idx_to_viz = np.round(np.linspace(0, num_matches-1, n_viz)).astype(int)
+ viz_matches_im0, viz_matches_im1 = matches_im0[match_idx_to_viz], matches_im1[match_idx_to_viz]
+
+ H0, W0, H1, W1 = *imgs[0].shape[:2], *imgs[1].shape[:2]
+ img0 = np.pad(imgs[0], ((0, max(H1 - H0, 0)), (0, 0), (0, 0)), 'constant', constant_values=0)
+ img1 = np.pad(imgs[1], ((0, max(H0 - H1, 0)), (0, 0), (0, 0)), 'constant', constant_values=0)
+ img = np.concatenate((img0, img1), axis=1)
+ pl.figure()
+ pl.imshow(img)
+ cmap = pl.get_cmap('jet')
+ for i in range(n_viz):
+ (x0, y0), (x1, y1) = viz_matches_im0[i].T, viz_matches_im1[i].T
+ pl.plot([x0, x1 + W0], [y0, y1], '-+', color=cmap(i / (n_viz - 1)), scalex=False, scaley=False)
+ pl.show(block=True)
+
+```
+
+
+## Training
+
+In this section, we present a short demonstration to get started with training DUSt3R.
+
+### Datasets
+At this moment, we have added the following training datasets:
+ - [CO3Dv2](https://github.com/facebookresearch/co3d) - [Creative Commons Attribution-NonCommercial 4.0 International](https://github.com/facebookresearch/co3d/blob/main/LICENSE)
+ - [ARKitScenes](https://github.com/apple/ARKitScenes) - [Creative Commons Attribution-NonCommercial-ShareAlike 4.0](https://github.com/apple/ARKitScenes/tree/main?tab=readme-ov-file#license)
+ - [ScanNet++](https://kaldir.vc.in.tum.de/scannetpp/) - [non-commercial research and educational purposes](https://kaldir.vc.in.tum.de/scannetpp/static/scannetpp-terms-of-use.pdf)
+ - [BlendedMVS](https://github.com/YoYo000/BlendedMVS) - [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
+ - [WayMo Open dataset](https://github.com/waymo-research/waymo-open-dataset) - [Non-Commercial Use](https://waymo.com/open/terms/)
+ - [Habitat-Sim](https://github.com/facebookresearch/habitat-sim/blob/main/DATASETS.md)
+ - [MegaDepth](https://www.cs.cornell.edu/projects/megadepth/)
+ - [StaticThings3D](https://github.com/lmb-freiburg/robustmvd/blob/master/rmvd/data/README.md#staticthings3d)
+ - [WildRGB-D](https://github.com/wildrgbd/wildrgbd/)
+
+For each dataset, we provide a preprocessing script in the `datasets_preprocess` directory and an archive containing the list of pairs when needed.
+You have to download the datasets yourself from their official sources, agree to their license, download our list of pairs, and run the preprocessing script.
+
+Links:
+
+[ARKitScenes pairs](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/arkitscenes_pairs.zip)
+[ScanNet++ pairs](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/scannetpp_pairs.zip)
+[BlendedMVS pairs](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/blendedmvs_pairs.npy)
+[WayMo Open dataset pairs](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/waymo_pairs.npz)
+[Habitat metadata](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/habitat_5views_v1_512x512_metadata.tar.gz)
+[MegaDepth pairs](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/megadepth_pairs.npz)
+[StaticThings3D pairs](https://download.europe.naverlabs.com/ComputerVision/DUSt3R/staticthings_pairs.npy)
+
+> [!NOTE]
+> They are not strictly equivalent to what was used to train DUSt3R, but they should be close enough.
+
+### Demo
+For this training demo, we're going to download and prepare a subset of [CO3Dv2](https://github.com/facebookresearch/co3d) - [Creative Commons Attribution-NonCommercial 4.0 International](https://github.com/facebookresearch/co3d/blob/main/LICENSE) and launch the training code on it.
+The demo model will be trained for a few epochs on a very small dataset.
+It will not be very good.
+
+```bash
+# download and prepare the co3d subset
+mkdir -p data/co3d_subset
+cd data/co3d_subset
+git clone https://github.com/facebookresearch/co3d
+cd co3d
+python3 ./co3d/download_dataset.py --download_folder ../ --single_sequence_subset
+rm ../*.zip
+cd ../../..
+
+python3 datasets_preprocess/preprocess_co3d.py --co3d_dir data/co3d_subset --output_dir data/co3d_subset_processed --single_sequence_subset
+
+# download the pretrained croco v2 checkpoint
+mkdir -p checkpoints/
+wget https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo_V2_ViTLarge_BaseDecoder.pth -P checkpoints/
+
+# the training of dust3r is done in 3 steps.
+# for this example we'll do fewer epochs, for the actual hyperparameters we used in the paper, see the next section: "Our Hyperparameters"
+# step 1 - train dust3r for 224 resolution
+torchrun --nproc_per_node=4 train.py \
+ --train_dataset "1000 @ Co3d(split='train', ROOT='data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=224, transform=ColorJitter)" \
+ --test_dataset "100 @ Co3d(split='test', ROOT='data/co3d_subset_processed', resolution=224, seed=777)" \
+ --model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', img_size=(224, 224), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
+ --train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
+ --test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
+ --pretrained "checkpoints/CroCo_V2_ViTLarge_BaseDecoder.pth" \
+ --lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 16 --accum_iter 1 \
+ --save_freq 1 --keep_freq 5 --eval_freq 1 \
+ --output_dir "checkpoints/dust3r_demo_224"
+
+# step 2 - train dust3r for 512 resolution
+torchrun --nproc_per_node=4 train.py \
+ --train_dataset "1000 @ Co3d(split='train', ROOT='data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter)" \
+ --test_dataset "100 @ Co3d(split='test', ROOT='data/co3d_subset_processed', resolution=(512,384), seed=777)" \
+ --model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
+ --train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
+ --test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
+ --pretrained "checkpoints/dust3r_demo_224/checkpoint-best.pth" \
+ --lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 4 --accum_iter 4 \
+ --save_freq 1 --keep_freq 5 --eval_freq 1 \
+ --output_dir "checkpoints/dust3r_demo_512"
+
+# step 3 - train dust3r for 512 resolution with dpt
+torchrun --nproc_per_node=4 train.py \
+ --train_dataset "1000 @ Co3d(split='train', ROOT='data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter)" \
+ --test_dataset "100 @ Co3d(split='test', ROOT='data/co3d_subset_processed', resolution=(512,384), seed=777)" \
+ --model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='dpt', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
+ --train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
+ --test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
+ --pretrained "checkpoints/dust3r_demo_512/checkpoint-best.pth" \
+ --lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 2 --accum_iter 8 \
+ --save_freq 1 --keep_freq 5 --eval_freq 1 --disable_cudnn_benchmark \
+ --output_dir "checkpoints/dust3r_demo_512dpt"
+
+```
+
+### Our Hyperparameters
+
+Here are the commands we used for training the models:
+
+```bash
+# NOTE: ROOT path omitted for datasets
+# 224 linear
+torchrun --nproc_per_node 8 train.py \
+ --train_dataset=" + 100_000 @ Habitat(1_000_000, split='train', aug_crop=16, resolution=224, transform=ColorJitter) + 100_000 @ BlendedMVS(split='train', aug_crop=16, resolution=224, transform=ColorJitter) + 100_000 @ MegaDepth(split='train', aug_crop=16, resolution=224, transform=ColorJitter) + 100_000 @ ARKitScenes(aug_crop=256, resolution=224, transform=ColorJitter) + 100_000 @ Co3d(split='train', aug_crop=16, mask_bg='rand', resolution=224, transform=ColorJitter) + 100_000 @ StaticThings3D(aug_crop=256, mask_bg='rand', resolution=224, transform=ColorJitter) + 100_000 @ ScanNetpp(split='train', aug_crop=256, resolution=224, transform=ColorJitter) + 100_000 @ InternalUnreleasedDataset(aug_crop=128, resolution=224, transform=ColorJitter) " \
+ --test_dataset=" Habitat(1_000, split='val', resolution=224, seed=777) + 1_000 @ BlendedMVS(split='val', resolution=224, seed=777) + 1_000 @ MegaDepth(split='val', resolution=224, seed=777) + 1_000 @ Co3d(split='test', mask_bg='rand', resolution=224, seed=777) " \
+ --train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
+ --test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
+ --model="AsymmetricCroCo3DStereo(pos_embed='RoPE100', img_size=(224, 224), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
+ --pretrained="checkpoints/CroCo_V2_ViTLarge_BaseDecoder.pth" \
+ --lr=0.0001 --min_lr=1e-06 --warmup_epochs=10 --epochs=100 --batch_size=16 --accum_iter=1 \
+ --save_freq=5 --keep_freq=10 --eval_freq=1 \
+ --output_dir="checkpoints/dust3r_224"
+
+# 512 linear
+torchrun --nproc_per_node 8 train.py \
+ --train_dataset=" + 10_000 @ Habitat(1_000_000, split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ BlendedMVS(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ MegaDepth(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ARKitScenes(aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ Co3d(split='train', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ StaticThings3D(aug_crop=256, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ScanNetpp(split='train', aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ InternalUnreleasedDataset(aug_crop=128, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) " \
+ --test_dataset=" Habitat(1_000, split='val', resolution=(512,384), seed=777) + 1_000 @ BlendedMVS(split='val', resolution=(512,384), seed=777) + 1_000 @ MegaDepth(split='val', resolution=(512,336), seed=777) + 1_000 @ Co3d(split='test', resolution=(512,384), seed=777) " \
+ --train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
+ --test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
+ --model="AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
+ --pretrained="checkpoints/dust3r_224/checkpoint-best.pth" \
+ --lr=0.0001 --min_lr=1e-06 --warmup_epochs=20 --epochs=100 --batch_size=4 --accum_iter=2 \
+ --save_freq=10 --keep_freq=10 --eval_freq=1 --print_freq=10 \
+ --output_dir="checkpoints/dust3r_512"
+
+# 512 dpt
+torchrun --nproc_per_node 8 train.py \
+ --train_dataset=" + 10_000 @ Habitat(1_000_000, split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ BlendedMVS(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ MegaDepth(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ARKitScenes(aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ Co3d(split='train', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ StaticThings3D(aug_crop=256, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ScanNetpp(split='train', aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ InternalUnreleasedDataset(aug_crop=128, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) " \
+ --test_dataset=" Habitat(1_000, split='val', resolution=(512,384), seed=777) + 1_000 @ BlendedMVS(split='val', resolution=(512,384), seed=777) + 1_000 @ MegaDepth(split='val', resolution=(512,336), seed=777) + 1_000 @ Co3d(split='test', resolution=(512,384), seed=777) " \
+ --train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
+ --test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
+ --model="AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='dpt', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
+ --pretrained="checkpoints/dust3r_512/checkpoint-best.pth" \
+ --lr=0.0001 --min_lr=1e-06 --warmup_epochs=15 --epochs=90 --batch_size=4 --accum_iter=2 \
+ --save_freq=5 --keep_freq=10 --eval_freq=1 --print_freq=10 --disable_cudnn_benchmark \
+ --output_dir="checkpoints/dust3r_512dpt"
+
+```
diff --git a/third_party/dust3r/assets/demo.jpg b/third_party/dust3r/assets/demo.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c815d468d83a7e91a0ccc24a2f491b10178e955f
--- /dev/null
+++ b/third_party/dust3r/assets/demo.jpg
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:957a892f9033fb3e733546a202e3c07e362618c708eacf050979d4c4edd5435f
+size 339600
diff --git a/third_party/dust3r/assets/dust3r.jpg b/third_party/dust3r/assets/dust3r.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..2f65b18fdb613950a683186b2b0fbcbbbcad82e4
Binary files /dev/null and b/third_party/dust3r/assets/dust3r.jpg differ
diff --git a/third_party/dust3r/assets/dust3r_archi.jpg b/third_party/dust3r/assets/dust3r_archi.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..332de7f7dfd78ef70b9cf3defcebafec1e1a8d6e
Binary files /dev/null and b/third_party/dust3r/assets/dust3r_archi.jpg differ
diff --git a/third_party/dust3r/assets/matching.jpg b/third_party/dust3r/assets/matching.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..636e69c70921c7dac3872fedaee4d508af7ba4db
--- /dev/null
+++ b/third_party/dust3r/assets/matching.jpg
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ecfe07fd00505045a155902c5686cc23060782a8b020f7596829fb60584a79ee
+size 159312
diff --git a/third_party/dust3r/assets/pipeline1.jpg b/third_party/dust3r/assets/pipeline1.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..5a0fc1e800b92fae577d6293ad50c8ee1815c3e8
Binary files /dev/null and b/third_party/dust3r/assets/pipeline1.jpg differ
diff --git a/third_party/dust3r/croco/LICENSE b/third_party/dust3r/croco/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..d9b84b1a65f9db6d8920a9048d162f52ba3ea56d
--- /dev/null
+++ b/third_party/dust3r/croco/LICENSE
@@ -0,0 +1,52 @@
+CroCo, Copyright (c) 2022-present Naver Corporation, is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.
+
+A summary of the CC BY-NC-SA 4.0 license is located here:
+ https://creativecommons.org/licenses/by-nc-sa/4.0/
+
+The CC BY-NC-SA 4.0 license is located here:
+ https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
+
+
+SEE NOTICE BELOW WITH RESPECT TO THE FILE: models/pos_embed.py, models/blocks.py
+
+***************************
+
+NOTICE WITH RESPECT TO THE FILE: models/pos_embed.py
+
+This software is being redistributed in a modifiled form. The original form is available here:
+
+https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+
+This software in this file incorporates parts of the following software available here:
+
+Transformer: https://github.com/tensorflow/models/blob/master/official/legacy/transformer/model_utils.py
+available under the following license: https://github.com/tensorflow/models/blob/master/LICENSE
+
+MoCo v3: https://github.com/facebookresearch/moco-v3
+available under the following license: https://github.com/facebookresearch/moco-v3/blob/main/LICENSE
+
+DeiT: https://github.com/facebookresearch/deit
+available under the following license: https://github.com/facebookresearch/deit/blob/main/LICENSE
+
+
+ORIGINAL COPYRIGHT NOTICE AND PERMISSION NOTICE AVAILABLE HERE IS REPRODUCE BELOW:
+
+https://github.com/facebookresearch/mae/blob/main/LICENSE
+
+Attribution-NonCommercial 4.0 International
+
+***************************
+
+NOTICE WITH RESPECT TO THE FILE: models/blocks.py
+
+This software is being redistributed in a modifiled form. The original form is available here:
+
+https://github.com/rwightman/pytorch-image-models
+
+ORIGINAL COPYRIGHT NOTICE AND PERMISSION NOTICE AVAILABLE HERE IS REPRODUCE BELOW:
+
+https://github.com/rwightman/pytorch-image-models/blob/master/LICENSE
+
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
\ No newline at end of file
diff --git a/third_party/dust3r/croco/NOTICE b/third_party/dust3r/croco/NOTICE
new file mode 100644
index 0000000000000000000000000000000000000000..d51bb365036c12d428d6e3a4fd00885756d5261c
--- /dev/null
+++ b/third_party/dust3r/croco/NOTICE
@@ -0,0 +1,21 @@
+CroCo
+Copyright 2022-present NAVER Corp.
+
+This project contains subcomponents with separate copyright notices and license terms.
+Your use of the source code for these subcomponents is subject to the terms and conditions of the following licenses.
+
+====
+
+facebookresearch/mae
+https://github.com/facebookresearch/mae
+
+Attribution-NonCommercial 4.0 International
+
+====
+
+rwightman/pytorch-image-models
+https://github.com/rwightman/pytorch-image-models
+
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
\ No newline at end of file
diff --git a/third_party/dust3r/croco/README.MD b/third_party/dust3r/croco/README.MD
new file mode 100644
index 0000000000000000000000000000000000000000..38e33b001a60bd16749317fb297acd60f28a6f1b
--- /dev/null
+++ b/third_party/dust3r/croco/README.MD
@@ -0,0 +1,124 @@
+# CroCo + CroCo v2 / CroCo-Stereo / CroCo-Flow
+
+[[`CroCo arXiv`](https://arxiv.org/abs/2210.10716)] [[`CroCo v2 arXiv`](https://arxiv.org/abs/2211.10408)] [[`project page and demo`](https://croco.europe.naverlabs.com/)]
+
+This repository contains the code for our CroCo model presented in our NeurIPS'22 paper [CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion](https://openreview.net/pdf?id=wZEfHUM5ri) and its follow-up extension published at ICCV'23 [Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow](https://openaccess.thecvf.com/content/ICCV2023/html/Weinzaepfel_CroCo_v2_Improved_Cross-view_Completion_Pre-training_for_Stereo_Matching_and_ICCV_2023_paper.html), refered to as CroCo v2:
+
+
+
+```bibtex
+@inproceedings{croco,
+ title={{CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion}},
+ author={{Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br\'egier, Romain and Cabon, Yohann and Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris and Csurka, Gabriela and Revaud J\'er\^ome}},
+ booktitle={{NeurIPS}},
+ year={2022}
+}
+
+@inproceedings{croco_v2,
+ title={{CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow}},
+ author={Weinzaepfel, Philippe and Lucas, Thomas and Leroy, Vincent and Cabon, Yohann and Arora, Vaibhav and Br{\'e}gier, Romain and Csurka, Gabriela and Antsfeld, Leonid and Chidlovskii, Boris and Revaud, J{\'e}r{\^o}me},
+ booktitle={ICCV},
+ year={2023}
+}
+```
+
+## License
+
+The code is distributed under the CC BY-NC-SA 4.0 License. See [LICENSE](LICENSE) for more information.
+Some components are based on code from [MAE](https://github.com/facebookresearch/mae) released under the CC BY-NC-SA 4.0 License and [timm](https://github.com/rwightman/pytorch-image-models) released under the Apache 2.0 License.
+Some components for stereo matching and optical flow are based on code from [unimatch](https://github.com/autonomousvision/unimatch) released under the MIT license.
+
+## Preparation
+
+1. Install dependencies on a machine with a NVidia GPU using e.g. conda. Note that `habitat-sim` is required only for the interactive demo and the synthetic pre-training data generation. If you don't plan to use it, you can ignore the line installing it and use a more recent python version.
+
+```bash
+conda create -n croco python=3.7 cmake=3.14.0
+conda activate croco
+conda install habitat-sim headless -c conda-forge -c aihabitat
+conda install pytorch torchvision -c pytorch
+conda install notebook ipykernel matplotlib
+conda install ipywidgets widgetsnbextension
+conda install scikit-learn tqdm quaternion opencv # only for pretraining / habitat data generation
+
+```
+
+2. Compile cuda kernels for RoPE
+
+CroCo v2 relies on RoPE positional embeddings for which you need to compile some cuda kernels.
+```bash
+cd models/curope/
+python setup.py build_ext --inplace
+cd ../../
+```
+
+This can be a bit long as we compile for all cuda architectures, feel free to update L9 of `models/curope/setup.py` to compile for specific architectures only.
+You might also need to set the environment `CUDA_HOME` in case you use a custom cuda installation.
+
+In case you cannot provide, we also provide a slow pytorch version, which will be automatically loaded.
+
+3. Download pre-trained model
+
+We provide several pre-trained models:
+
+| modelname | pre-training data | pos. embed. | Encoder | Decoder |
+|------------------------------------------------------------------------------------------------------------------------------------|-------------------|-------------|---------|---------|
+| [`CroCo.pth`](https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo.pth) | Habitat | cosine | ViT-B | Small |
+| [`CroCo_V2_ViTBase_SmallDecoder.pth`](https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo_V2_ViTBase_SmallDecoder.pth) | Habitat + real | RoPE | ViT-B | Small |
+| [`CroCo_V2_ViTBase_BaseDecoder.pth`](https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo_V2_ViTBase_BaseDecoder.pth) | Habitat + real | RoPE | ViT-B | Base |
+| [`CroCo_V2_ViTLarge_BaseDecoder.pth`](https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo_V2_ViTLarge_BaseDecoder.pth) | Habitat + real | RoPE | ViT-L | Base |
+
+To download a specific model, i.e., the first one (`CroCo.pth`)
+```bash
+mkdir -p pretrained_models/
+wget https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo.pth -P pretrained_models/
+```
+
+## Reconstruction example
+
+Simply run after downloading the `CroCo_V2_ViTLarge_BaseDecoder` pretrained model (or update the corresponding line in `demo.py`)
+```bash
+python demo.py
+```
+
+## Interactive demonstration of cross-view completion reconstruction on the Habitat simulator
+
+First download the test scene from Habitat:
+```bash
+python -m habitat_sim.utils.datasets_download --uids habitat_test_scenes --data-path habitat-sim-data/
+```
+
+Then, run the Notebook demo `interactive_demo.ipynb`.
+
+In this demo, you should be able to sample a random reference viewpoint from an [Habitat](https://github.com/facebookresearch/habitat-sim) test scene. Use the sliders to change viewpoint and select a masked target view to reconstruct using CroCo.
+
+
+## Pre-training
+
+### CroCo
+
+To pre-train CroCo, please first generate the pre-training data from the Habitat simulator, following the instructions in [datasets/habitat_sim/README.MD](datasets/habitat_sim/README.MD) and then run the following command:
+```
+torchrun --nproc_per_node=4 pretrain.py --output_dir ./output/pretraining/
+```
+
+Our CroCo pre-training was launched on a single server with 4 GPUs.
+It should take around 10 days with A100 or 15 days with V100 to do the 400 pre-training epochs, but decent performances are obtained earlier in training.
+Note that, while the code contains the same scaling rule of the learning rate as MAE when changing the effective batch size, we did not experimented if it is valid in our case.
+The first run can take a few minutes to start, to parse all available pre-training pairs.
+
+### CroCo v2
+
+For CroCo v2 pre-training, in addition to the generation of the pre-training data from the Habitat simulator above, please pre-extract the crops from the real datasets following the instructions in [datasets/crops/README.MD](datasets/crops/README.MD).
+Then, run the following command for the largest model (ViT-L encoder, Base decoder):
+```
+torchrun --nproc_per_node=8 pretrain.py --model "CroCoNet(enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_num_heads=12, dec_depth=12, pos_embed='RoPE100')" --dataset "habitat_release+ARKitScenes+MegaDepth+3DStreetView+IndoorVL" --warmup_epochs 12 --max_epoch 125 --epochs 250 --amp 0 --keep_freq 5 --output_dir ./output/pretraining_crocov2/
+```
+
+Our CroCo v2 pre-training was launched on a single server with 8 GPUs for the largest model, and on a single server with 4 GPUs for the smaller ones, keeping a batch size of 64 per gpu in all cases.
+The largest model should take around 12 days on A100.
+Note that, while the code contains the same scaling rule of the learning rate as MAE when changing the effective batch size, we did not experimented if it is valid in our case.
+
+## Stereo matching and Optical flow downstream tasks
+
+For CroCo-Stereo and CroCo-Flow, please refer to [stereoflow/README.MD](stereoflow/README.MD).
diff --git a/third_party/dust3r/croco/assets/Chateau1.png b/third_party/dust3r/croco/assets/Chateau1.png
new file mode 100644
index 0000000000000000000000000000000000000000..295b00e46972ffcacaca60c2c7c7ec7a04c762fa
--- /dev/null
+++ b/third_party/dust3r/croco/assets/Chateau1.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:71ffb8c7d77e5ced0bb3dcd2cb0db84d0e98e6ff5ffd2d02696a7156e5284857
+size 112106
diff --git a/third_party/dust3r/croco/assets/Chateau2.png b/third_party/dust3r/croco/assets/Chateau2.png
new file mode 100644
index 0000000000000000000000000000000000000000..97b3c058ff180a6d0c0853ab533b0823a06f8425
--- /dev/null
+++ b/third_party/dust3r/croco/assets/Chateau2.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c3a0be9e19f6b89491d692c71e3f2317c2288a898a990561d48b7667218b47c8
+size 109905
diff --git a/third_party/dust3r/croco/assets/arch.jpg b/third_party/dust3r/croco/assets/arch.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..3f5b032729ddc58c06d890a0ebda1749276070c4
Binary files /dev/null and b/third_party/dust3r/croco/assets/arch.jpg differ
diff --git a/third_party/dust3r/croco/croco-stereo-flow-demo.ipynb b/third_party/dust3r/croco/croco-stereo-flow-demo.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..2b00a7607ab5f82d1857041969bfec977e56b3e0
--- /dev/null
+++ b/third_party/dust3r/croco/croco-stereo-flow-demo.ipynb
@@ -0,0 +1,191 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "9bca0f41",
+ "metadata": {},
+ "source": [
+ "# Simple inference example with CroCo-Stereo or CroCo-Flow"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "80653ef7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Copyright (C) 2022-present Naver Corporation. All rights reserved.\n",
+ "# Licensed under CC BY-NC-SA 4.0 (non-commercial use only)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f033862",
+ "metadata": {},
+ "source": [
+ "First download the model(s) of your choice by running\n",
+ "```\n",
+ "bash stereoflow/download_model.sh crocostereo.pth\n",
+ "bash stereoflow/download_model.sh crocoflow.pth\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1fb2e392",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "use_gpu = torch.cuda.is_available() and torch.cuda.device_count()>0\n",
+ "device = torch.device('cuda:0' if use_gpu else 'cpu')\n",
+ "import matplotlib.pylab as plt"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e0e25d77",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from stereoflow.test import _load_model_and_criterion\n",
+ "from stereoflow.engine import tiled_pred\n",
+ "from stereoflow.datasets_stereo import img_to_tensor, vis_disparity\n",
+ "from stereoflow.datasets_flow import flowToColor\n",
+ "tile_overlap=0.7 # recommended value, higher value can be slightly better but slower"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "86a921f5",
+ "metadata": {},
+ "source": [
+ "### CroCo-Stereo example"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "64e483cb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "image1 = np.asarray(Image.open(''))\n",
+ "image2 = np.asarray(Image.open(''))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f0d04303",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model, _, cropsize, with_conf, task, tile_conf_mode = _load_model_and_criterion('stereoflow_models/crocostereo.pth', None, device)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "47dc14b5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "im1 = img_to_tensor(image1).to(device).unsqueeze(0)\n",
+ "im2 = img_to_tensor(image2).to(device).unsqueeze(0)\n",
+ "with torch.inference_mode():\n",
+ " pred, _, _ = tiled_pred(model, None, im1, im2, None, conf_mode=tile_conf_mode, overlap=tile_overlap, crop=cropsize, with_conf=with_conf, return_time=False)\n",
+ "pred = pred.squeeze(0).squeeze(0).cpu().numpy()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "583b9f16",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plt.imshow(vis_disparity(pred))\n",
+ "plt.axis('off')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2df5d70",
+ "metadata": {},
+ "source": [
+ "### CroCo-Flow example"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ee257a7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "image1 = np.asarray(Image.open(''))\n",
+ "image2 = np.asarray(Image.open(''))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d5edccf0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model, _, cropsize, with_conf, task, tile_conf_mode = _load_model_and_criterion('stereoflow_models/crocoflow.pth', None, device)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b19692c3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "im1 = img_to_tensor(image1).to(device).unsqueeze(0)\n",
+ "im2 = img_to_tensor(image2).to(device).unsqueeze(0)\n",
+ "with torch.inference_mode():\n",
+ " pred, _, _ = tiled_pred(model, None, im1, im2, None, conf_mode=tile_conf_mode, overlap=tile_overlap, crop=cropsize, with_conf=with_conf, return_time=False)\n",
+ "pred = pred.squeeze(0).permute(1,2,0).cpu().numpy()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "26f79db3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plt.imshow(flowToColor(pred))\n",
+ "plt.axis('off')"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/third_party/dust3r/croco/datasets/__init__.py b/third_party/dust3r/croco/datasets/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/third_party/dust3r/croco/datasets/crops/README.MD b/third_party/dust3r/croco/datasets/crops/README.MD
new file mode 100644
index 0000000000000000000000000000000000000000..47ddabebb177644694ee247ae878173a3a16644f
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/crops/README.MD
@@ -0,0 +1,104 @@
+## Generation of crops from the real datasets
+
+The instructions below allow to generate the crops used for pre-training CroCo v2 from the following real-world datasets: ARKitScenes, MegaDepth, 3DStreetView and IndoorVL.
+
+### Download the metadata of the crops to generate
+
+First, download the metadata and put them in `./data/`:
+```
+mkdir -p data
+cd data/
+wget https://download.europe.naverlabs.com/ComputerVision/CroCo/data/crop_metadata.zip
+unzip crop_metadata.zip
+rm crop_metadata.zip
+cd ..
+```
+
+### Prepare the original datasets
+
+Second, download the original datasets in `./data/original_datasets/`.
+```
+mkdir -p data/original_datasets
+```
+
+##### ARKitScenes
+
+Download the `raw` dataset from https://github.com/apple/ARKitScenes/blob/main/DATA.md and put it in `./data/original_datasets/ARKitScenes/`.
+The resulting file structure should be like:
+```
+./data/original_datasets/ARKitScenes/
+└───Training
+ └───40753679
+ │ │ ultrawide
+ │ │ ...
+ └───40753686
+ │
+ ...
+```
+
+##### MegaDepth
+
+Download `MegaDepth v1 Dataset` from https://www.cs.cornell.edu/projects/megadepth/ and put it in `./data/original_datasets/MegaDepth/`.
+The resulting file structure should be like:
+
+```
+./data/original_datasets/MegaDepth/
+└───0000
+│ └───images
+│ │ │ 1000557903_87fa96b8a4_o.jpg
+│ │ └ ...
+│ └─── ...
+└───0001
+│ │
+│ └ ...
+└─── ...
+```
+
+##### 3DStreetView
+
+Download `3D_Street_View` dataset from https://github.com/amir32002/3D_Street_View and put it in `./data/original_datasets/3DStreetView/`.
+The resulting file structure should be like:
+
+```
+./data/original_datasets/3DStreetView/
+└───dataset_aligned
+│ └───0002
+│ │ │ 0000002_0000001_0000002_0000001.jpg
+│ │ └ ...
+│ └─── ...
+└───dataset_unaligned
+│ └───0003
+│ │ │ 0000003_0000001_0000002_0000001.jpg
+│ │ └ ...
+│ └─── ...
+```
+
+##### IndoorVL
+
+Download the `IndoorVL` datasets using [Kapture](https://github.com/naver/kapture).
+
+```
+pip install kapture
+mkdir -p ./data/original_datasets/IndoorVL
+cd ./data/original_datasets/IndoorVL
+kapture_download_dataset.py update
+kapture_download_dataset.py install "HyundaiDepartmentStore_*"
+kapture_download_dataset.py install "GangnamStation_*"
+cd -
+```
+
+### Extract the crops
+
+Now, extract the crops for each of the dataset:
+```
+for dataset in ARKitScenes MegaDepth 3DStreetView IndoorVL;
+do
+ python3 datasets/crops/extract_crops_from_images.py --crops ./data/crop_metadata/${dataset}/crops_release.txt --root-dir ./data/original_datasets/${dataset}/ --output-dir ./data/${dataset}_crops/ --imsize 256 --nthread 8 --max-subdir-levels 5 --ideal-number-pairs-in-dir 500;
+done
+```
+
+##### Note for IndoorVL
+
+Due to some legal issues, we can only release 144,228 pairs out of the 1,593,689 pairs used in the paper.
+To account for it in terms of number of pre-training iterations, the pre-training command in this repository uses 125 training epochs including 12 warm-up epochs and learning rate cosine schedule of 250, instead of 100, 10 and 200 respectively.
+The impact on the performance is negligible.
diff --git a/third_party/dust3r/croco/datasets/crops/extract_crops_from_images.py b/third_party/dust3r/croco/datasets/crops/extract_crops_from_images.py
new file mode 100644
index 0000000000000000000000000000000000000000..032be73899d7f72604ae0d1bc00cdb67728fd72e
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/crops/extract_crops_from_images.py
@@ -0,0 +1,184 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Extracting crops for pre-training
+# --------------------------------------------------------
+
+import argparse
+import functools
+import math
+import os
+from multiprocessing import Pool
+
+from PIL import Image
+from tqdm import tqdm
+
+
+def arg_parser():
+ parser = argparse.ArgumentParser(
+ "Generate cropped image pairs from image crop list"
+ )
+
+ parser.add_argument("--crops", type=str, required=True, help="crop file")
+ parser.add_argument("--root-dir", type=str, required=True, help="root directory")
+ parser.add_argument(
+ "--output-dir", type=str, required=True, help="output directory"
+ )
+ parser.add_argument("--imsize", type=int, default=256, help="size of the crops")
+ parser.add_argument(
+ "--nthread", type=int, required=True, help="number of simultaneous threads"
+ )
+ parser.add_argument(
+ "--max-subdir-levels",
+ type=int,
+ default=5,
+ help="maximum number of subdirectories",
+ )
+ parser.add_argument(
+ "--ideal-number-pairs-in-dir",
+ type=int,
+ default=500,
+ help="number of pairs stored in a dir",
+ )
+ return parser
+
+
+def main(args):
+ listing_path = os.path.join(args.output_dir, "listing.txt")
+
+ print(f"Loading list of crops ... ({args.nthread} threads)")
+ crops, num_crops_to_generate = load_crop_file(args.crops)
+
+ print(f"Preparing jobs ({len(crops)} candidate image pairs)...")
+ num_levels = min(
+ math.ceil(math.log(num_crops_to_generate, args.ideal_number_pairs_in_dir)),
+ args.max_subdir_levels,
+ )
+ num_pairs_in_dir = math.ceil(num_crops_to_generate ** (1 / num_levels))
+
+ jobs = prepare_jobs(crops, num_levels, num_pairs_in_dir)
+ del crops
+
+ os.makedirs(args.output_dir, exist_ok=True)
+ mmap = Pool(args.nthread).imap_unordered if args.nthread > 1 else map
+ call = functools.partial(save_image_crops, args)
+
+ print(f"Generating cropped images to {args.output_dir} ...")
+ with open(listing_path, "w") as listing:
+ listing.write("# pair_path\n")
+ for results in tqdm(mmap(call, jobs), total=len(jobs)):
+ for path in results:
+ listing.write(f"{path}\n")
+ print("Finished writing listing to", listing_path)
+
+
+def load_crop_file(path):
+ data = open(path).read().splitlines()
+ pairs = []
+ num_crops_to_generate = 0
+ for line in tqdm(data):
+ if line.startswith("#"):
+ continue
+ line = line.split(", ")
+ if len(line) < 8:
+ img1, img2, rotation = line
+ pairs.append((img1, img2, int(rotation), []))
+ else:
+ l1, r1, t1, b1, l2, r2, t2, b2 = map(int, line)
+ rect1, rect2 = (l1, t1, r1, b1), (l2, t2, r2, b2)
+ pairs[-1][-1].append((rect1, rect2))
+ num_crops_to_generate += 1
+ return pairs, num_crops_to_generate
+
+
+def prepare_jobs(pairs, num_levels, num_pairs_in_dir):
+ jobs = []
+ powers = [num_pairs_in_dir**level for level in reversed(range(num_levels))]
+
+ def get_path(idx):
+ idx_array = []
+ d = idx
+ for level in range(num_levels - 1):
+ idx_array.append(idx // powers[level])
+ idx = idx % powers[level]
+ idx_array.append(d)
+ return "/".join(map(lambda x: hex(x)[2:], idx_array))
+
+ idx = 0
+ for pair_data in tqdm(pairs):
+ img1, img2, rotation, crops = pair_data
+ if -60 <= rotation and rotation <= 60:
+ rotation = 0 # most likely not a true rotation
+ paths = [get_path(idx + k) for k in range(len(crops))]
+ idx += len(crops)
+ jobs.append(((img1, img2), rotation, crops, paths))
+ return jobs
+
+
+def load_image(path):
+ try:
+ return Image.open(path).convert("RGB")
+ except Exception as e:
+ print("skipping", path, e)
+ raise OSError()
+
+
+def save_image_crops(args, data):
+ # load images
+ img_pair, rot, crops, paths = data
+ try:
+ img1, img2 = [
+ load_image(os.path.join(args.root_dir, impath)) for impath in img_pair
+ ]
+ except OSError as e:
+ return []
+
+ def area(sz):
+ return sz[0] * sz[1]
+
+ tgt_size = (args.imsize, args.imsize)
+
+ def prepare_crop(img, rect, rot=0):
+ # actual crop
+ img = img.crop(rect)
+
+ # resize to desired size
+ interp = (
+ Image.Resampling.LANCZOS
+ if area(img.size) > 4 * area(tgt_size)
+ else Image.Resampling.BICUBIC
+ )
+ img = img.resize(tgt_size, resample=interp)
+
+ # rotate the image
+ rot90 = (round(rot / 90) % 4) * 90
+ if rot90 == 90:
+ img = img.transpose(Image.Transpose.ROTATE_90)
+ elif rot90 == 180:
+ img = img.transpose(Image.Transpose.ROTATE_180)
+ elif rot90 == 270:
+ img = img.transpose(Image.Transpose.ROTATE_270)
+ return img
+
+ results = []
+ for (rect1, rect2), path in zip(crops, paths):
+ crop1 = prepare_crop(img1, rect1)
+ crop2 = prepare_crop(img2, rect2, rot)
+
+ fullpath1 = os.path.join(args.output_dir, path + "_1.jpg")
+ fullpath2 = os.path.join(args.output_dir, path + "_2.jpg")
+ os.makedirs(os.path.dirname(fullpath1), exist_ok=True)
+
+ assert not os.path.isfile(fullpath1), fullpath1
+ assert not os.path.isfile(fullpath2), fullpath2
+ crop1.save(fullpath1)
+ crop2.save(fullpath2)
+ results.append(path)
+
+ return results
+
+
+if __name__ == "__main__":
+ args = arg_parser().parse_args()
+ main(args)
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/README.MD b/third_party/dust3r/croco/datasets/habitat_sim/README.MD
new file mode 100644
index 0000000000000000000000000000000000000000..a505781ff9eb91bce7f1d189e848f8ba1c560940
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/README.MD
@@ -0,0 +1,76 @@
+## Generation of synthetic image pairs using Habitat-Sim
+
+These instructions allow to generate pre-training pairs from the Habitat simulator.
+As we did not save metadata of the pairs used in the original paper, they are not strictly the same, but these data use the same setting and are equivalent.
+
+### Download Habitat-Sim scenes
+Download Habitat-Sim scenes:
+- Download links can be found here: https://github.com/facebookresearch/habitat-sim/blob/main/DATASETS.md
+- We used scenes from the HM3D, habitat-test-scenes, Replica, ReplicaCad and ScanNet datasets.
+- Please put the scenes under `./data/habitat-sim-data/scene_datasets/` following the structure below, or update manually paths in `paths.py`.
+```
+./data/
+└──habitat-sim-data/
+ └──scene_datasets/
+ ├──hm3d/
+ ├──gibson/
+ ├──habitat-test-scenes/
+ ├──replica_cad_baked_lighting/
+ ├──replica_cad/
+ ├──ReplicaDataset/
+ └──scannet/
+```
+
+### Image pairs generation
+We provide metadata to generate reproducible images pairs for pretraining and validation.
+Experiments described in the paper used similar data, but whose generation was not reproducible at the time.
+
+Specifications:
+- 256x256 resolution images, with 60 degrees field of view .
+- Up to 1000 image pairs per scene.
+- Number of scenes considered/number of images pairs per dataset:
+ - Scannet: 1097 scenes / 985 209 pairs
+ - HM3D:
+ - hm3d/train: 800 / 800k pairs
+ - hm3d/val: 100 scenes / 100k pairs
+ - hm3d/minival: 10 scenes / 10k pairs
+ - habitat-test-scenes: 3 scenes / 3k pairs
+ - replica_cad_baked_lighting: 13 scenes / 13k pairs
+
+- Scenes from hm3d/val and hm3d/minival pairs were not used for the pre-training but kept for validation purposes.
+
+Download metadata and extract it:
+```bash
+mkdir -p data/habitat_release_metadata/
+cd data/habitat_release_metadata/
+wget https://download.europe.naverlabs.com/ComputerVision/CroCo/data/habitat_release_metadata/multiview_habitat_metadata.tar.gz
+tar -xvf multiview_habitat_metadata.tar.gz
+cd ../..
+# Location of the metadata
+METADATA_DIR="./data/habitat_release_metadata/multiview_habitat_metadata"
+```
+
+Generate image pairs from metadata:
+- The following command will print a list of commandlines to generate image pairs for each scene:
+```bash
+# Target output directory
+PAIRS_DATASET_DIR="./data/habitat_release/"
+python datasets/habitat_sim/generate_from_metadata_files.py --input_dir=$METADATA_DIR --output_dir=$PAIRS_DATASET_DIR
+```
+- One can launch multiple of such commands in parallel e.g. using GNU Parallel:
+```bash
+python datasets/habitat_sim/generate_from_metadata_files.py --input_dir=$METADATA_DIR --output_dir=$PAIRS_DATASET_DIR | parallel -j 16
+```
+
+## Metadata generation
+
+Image pairs were randomly sampled using the following commands, whose outputs contain randomness and are thus not exactly reproducible:
+```bash
+# Print commandlines to generate image pairs from the different scenes available.
+PAIRS_DATASET_DIR=MY_CUSTOM_PATH
+python datasets/habitat_sim/generate_multiview_images.py --list_commands --output_dir=$PAIRS_DATASET_DIR
+
+# Once a dataset is generated, pack metadata files for reproducibility.
+METADATA_DIR=MY_CUSTON_PATH
+python datasets/habitat_sim/pack_metadata_files.py $PAIRS_DATASET_DIR $METADATA_DIR
+```
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/__init__.py b/third_party/dust3r/croco/datasets/habitat_sim/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/generate_from_metadata.py b/third_party/dust3r/croco/datasets/habitat_sim/generate_from_metadata.py
new file mode 100644
index 0000000000000000000000000000000000000000..91e6371b03ff919308f557babc3fc7a565510d22
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/generate_from_metadata.py
@@ -0,0 +1,126 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+"""
+Script to generate image pairs for a given scene reproducing poses provided in a metadata file.
+"""
+import argparse
+import json
+import os
+
+import cv2
+import PIL.Image
+import quaternion
+from datasets.habitat_sim.multiview_habitat_sim_generator import (
+ MultiviewHabitatSimGenerator,
+)
+from datasets.habitat_sim.paths import SCENES_DATASET
+from tqdm import tqdm
+
+
+def generate_multiview_images_from_metadata(
+ metadata_filename,
+ output_dir,
+ overload_params=dict(),
+ scene_datasets_paths=None,
+ exist_ok=False,
+):
+ """
+ Generate images from a metadata file for reproducibility purposes.
+ """
+ # Reorder paths by decreasing label length, to avoid collisions when testing if a string by such label
+ if scene_datasets_paths is not None:
+ scene_datasets_paths = dict(
+ sorted(scene_datasets_paths.items(), key=lambda x: len(x[0]), reverse=True)
+ )
+
+ with open(metadata_filename, "r") as f:
+ input_metadata = json.load(f)
+ metadata = dict()
+ for key, value in input_metadata.items():
+ # Optionally replace some paths
+ if key in ("scene_dataset_config_file", "scene", "navmesh") and value != "":
+ if scene_datasets_paths is not None:
+ for dataset_label, dataset_path in scene_datasets_paths.items():
+ if value.startswith(dataset_label):
+ value = os.path.normpath(
+ os.path.join(
+ dataset_path, os.path.relpath(value, dataset_label)
+ )
+ )
+ break
+ metadata[key] = value
+
+ # Overload some parameters
+ for key, value in overload_params.items():
+ metadata[key] = value
+
+ generation_entries = dict(
+ [
+ (key, value)
+ for key, value in metadata.items()
+ if not (key in ("multiviews", "output_dir", "generate_depth"))
+ ]
+ )
+ generate_depth = metadata["generate_depth"]
+
+ os.makedirs(output_dir, exist_ok=exist_ok)
+
+ generator = MultiviewHabitatSimGenerator(**generation_entries)
+
+ # Generate views
+ for idx_label, data in tqdm(metadata["multiviews"].items()):
+ positions = data["positions"]
+ orientations = data["orientations"]
+ n = len(positions)
+ for oidx in range(n):
+ observation = generator.render_viewpoint(
+ positions[oidx], quaternion.from_float_array(orientations[oidx])
+ )
+ observation_label = f"{oidx + 1}" # Leonid is indexing starting from 1
+ # Color image saved using PIL
+ img = PIL.Image.fromarray(observation["color"][:, :, :3])
+ filename = os.path.join(output_dir, f"{idx_label}_{observation_label}.jpeg")
+ img.save(filename)
+ if generate_depth:
+ # Depth image as EXR file
+ filename = os.path.join(
+ output_dir, f"{idx_label}_{observation_label}_depth.exr"
+ )
+ cv2.imwrite(
+ filename,
+ observation["depth"],
+ [cv2.IMWRITE_EXR_TYPE, cv2.IMWRITE_EXR_TYPE_HALF],
+ )
+ # Camera parameters
+ camera_params = dict(
+ [
+ (key, observation[key].tolist())
+ for key in ("camera_intrinsics", "R_cam2world", "t_cam2world")
+ ]
+ )
+ filename = os.path.join(
+ output_dir, f"{idx_label}_{observation_label}_camera_params.json"
+ )
+ with open(filename, "w") as f:
+ json.dump(camera_params, f)
+ # Save metadata
+ with open(os.path.join(output_dir, "metadata.json"), "w") as f:
+ json.dump(metadata, f)
+
+ generator.close()
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--metadata_filename", required=True)
+ parser.add_argument("--output_dir", required=True)
+ args = parser.parse_args()
+
+ generate_multiview_images_from_metadata(
+ metadata_filename=args.metadata_filename,
+ output_dir=args.output_dir,
+ scene_datasets_paths=SCENES_DATASET,
+ overload_params=dict(),
+ exist_ok=True,
+ )
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/generate_from_metadata_files.py b/third_party/dust3r/croco/datasets/habitat_sim/generate_from_metadata_files.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7a613ac65727edbbdd883c0e33fdde15730606a
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/generate_from_metadata_files.py
@@ -0,0 +1,37 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+"""
+Script generating commandlines to generate image pairs from metadata files.
+"""
+import argparse
+import glob
+import os
+
+from tqdm import tqdm
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--input_dir", required=True)
+ parser.add_argument("--output_dir", required=True)
+ parser.add_argument(
+ "--prefix",
+ default="",
+ help="Commanline prefix, useful e.g. to setup environment.",
+ )
+ args = parser.parse_args()
+
+ input_metadata_filenames = glob.iglob(
+ f"{args.input_dir}/**/metadata.json", recursive=True
+ )
+
+ for metadata_filename in tqdm(input_metadata_filenames):
+ output_dir = os.path.join(
+ args.output_dir,
+ os.path.relpath(os.path.dirname(metadata_filename), args.input_dir),
+ )
+ # Do not process the scene if the metadata file already exists
+ if os.path.exists(os.path.join(output_dir, "metadata.json")):
+ continue
+ commandline = f"{args.prefix}python datasets/habitat_sim/generate_from_metadata.py --metadata_filename={metadata_filename} --output_dir={output_dir}"
+ print(commandline)
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/generate_multiview_images.py b/third_party/dust3r/croco/datasets/habitat_sim/generate_multiview_images.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1673925469d5c57626d202a5a189b55ccc57aa6
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/generate_multiview_images.py
@@ -0,0 +1,232 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+import argparse
+import json
+import os
+import shutil
+
+import cv2
+import numpy as np
+import PIL.Image
+import quaternion
+from datasets.habitat_sim.multiview_habitat_sim_generator import (
+ MultiviewHabitatSimGenerator,
+ NoNaviguableSpaceError,
+)
+from datasets.habitat_sim.paths import list_scenes_available
+from tqdm import tqdm
+
+
+def generate_multiview_images_for_scene(
+ scene_dataset_config_file,
+ scene,
+ navmesh,
+ output_dir,
+ views_count,
+ size,
+ exist_ok=False,
+ generate_depth=False,
+ **kwargs,
+):
+ """
+ Generate tuples of overlapping views for a given scene.
+ generate_depth: generate depth images and camera parameters.
+ """
+ if os.path.exists(output_dir) and not exist_ok:
+ print(f"Scene {scene}: data already generated. Ignoring generation.")
+ return
+ try:
+ print(f"Scene {scene}: {size} multiview acquisitions to generate...")
+ os.makedirs(output_dir, exist_ok=exist_ok)
+
+ metadata_filename = os.path.join(output_dir, "metadata.json")
+
+ metadata_template = dict(
+ scene_dataset_config_file=scene_dataset_config_file,
+ scene=scene,
+ navmesh=navmesh,
+ views_count=views_count,
+ size=size,
+ generate_depth=generate_depth,
+ **kwargs,
+ )
+ metadata_template["multiviews"] = dict()
+
+ if os.path.exists(metadata_filename):
+ print("Metadata file already exists:", metadata_filename)
+ print("Loading already generated metadata file...")
+ with open(metadata_filename, "r") as f:
+ metadata = json.load(f)
+
+ for key in metadata_template.keys():
+ if key != "multiviews":
+ assert (
+ metadata_template[key] == metadata[key]
+ ), f"existing file is inconsistent with the input parameters:\nKey: {key}\nmetadata: {metadata[key]}\ntemplate: {metadata_template[key]}."
+ else:
+ print("No temporary file found. Starting generation from scratch...")
+ metadata = metadata_template
+
+ starting_id = len(metadata["multiviews"])
+ print(f"Starting generation from index {starting_id}/{size}...")
+ if starting_id >= size:
+ print("Generation already done.")
+ return
+
+ generator = MultiviewHabitatSimGenerator(
+ scene_dataset_config_file=scene_dataset_config_file,
+ scene=scene,
+ navmesh=navmesh,
+ views_count=views_count,
+ size=size,
+ **kwargs,
+ )
+
+ for idx in tqdm(range(starting_id, size)):
+ # Generate / re-generate the observations
+ try:
+ data = generator[idx]
+ observations = data["observations"]
+ positions = data["positions"]
+ orientations = data["orientations"]
+
+ idx_label = f"{idx:08}"
+ for oidx, observation in enumerate(observations):
+ observation_label = (
+ f"{oidx + 1}" # Leonid is indexing starting from 1
+ )
+ # Color image saved using PIL
+ img = PIL.Image.fromarray(observation["color"][:, :, :3])
+ filename = os.path.join(
+ output_dir, f"{idx_label}_{observation_label}.jpeg"
+ )
+ img.save(filename)
+ if generate_depth:
+ # Depth image as EXR file
+ filename = os.path.join(
+ output_dir, f"{idx_label}_{observation_label}_depth.exr"
+ )
+ cv2.imwrite(
+ filename,
+ observation["depth"],
+ [cv2.IMWRITE_EXR_TYPE, cv2.IMWRITE_EXR_TYPE_HALF],
+ )
+ # Camera parameters
+ camera_params = dict(
+ [
+ (key, observation[key].tolist())
+ for key in (
+ "camera_intrinsics",
+ "R_cam2world",
+ "t_cam2world",
+ )
+ ]
+ )
+ filename = os.path.join(
+ output_dir,
+ f"{idx_label}_{observation_label}_camera_params.json",
+ )
+ with open(filename, "w") as f:
+ json.dump(camera_params, f)
+ metadata["multiviews"][idx_label] = {
+ "positions": positions.tolist(),
+ "orientations": orientations.tolist(),
+ "covisibility_ratios": data["covisibility_ratios"].tolist(),
+ "valid_fractions": data["valid_fractions"].tolist(),
+ "pairwise_visibility_ratios": data[
+ "pairwise_visibility_ratios"
+ ].tolist(),
+ }
+ except RecursionError:
+ print(
+ "Recursion error: unable to sample observations for this scene. We will stop there."
+ )
+ break
+
+ # Regularly save a temporary metadata file, in case we need to restart the generation
+ if idx % 10 == 0:
+ with open(metadata_filename, "w") as f:
+ json.dump(metadata, f)
+
+ # Save metadata
+ with open(metadata_filename, "w") as f:
+ json.dump(metadata, f)
+
+ generator.close()
+ except NoNaviguableSpaceError:
+ pass
+
+
+def create_commandline(scene_data, generate_depth, exist_ok=False):
+ """
+ Create a commandline string to generate a scene.
+ """
+
+ def my_formatting(val):
+ if val is None or val == "":
+ return '""'
+ else:
+ return val
+
+ commandline = f"""python {__file__} --scene {my_formatting(scene_data.scene)}
+ --scene_dataset_config_file {my_formatting(scene_data.scene_dataset_config_file)}
+ --navmesh {my_formatting(scene_data.navmesh)}
+ --output_dir {my_formatting(scene_data.output_dir)}
+ --generate_depth {int(generate_depth)}
+ --exist_ok {int(exist_ok)}
+ """
+ commandline = " ".join(commandline.split())
+ return commandline
+
+
+if __name__ == "__main__":
+ os.umask(2)
+
+ parser = argparse.ArgumentParser(
+ description="""Example of use -- listing commands to generate data for scenes available:
+ > python datasets/habitat_sim/generate_multiview_habitat_images.py --list_commands
+ """
+ )
+
+ parser.add_argument("--output_dir", type=str, required=True)
+ parser.add_argument(
+ "--list_commands", action="store_true", help="list commandlines to run if true"
+ )
+ parser.add_argument("--scene", type=str, default="")
+ parser.add_argument("--scene_dataset_config_file", type=str, default="")
+ parser.add_argument("--navmesh", type=str, default="")
+
+ parser.add_argument("--generate_depth", type=int, default=1)
+ parser.add_argument("--exist_ok", type=int, default=0)
+
+ kwargs = dict(resolution=(256, 256), hfov=60, views_count=2, size=1000)
+
+ args = parser.parse_args()
+ generate_depth = bool(args.generate_depth)
+ exist_ok = bool(args.exist_ok)
+
+ if args.list_commands:
+ # Listing scenes available...
+ scenes_data = list_scenes_available(base_output_dir=args.output_dir)
+
+ for scene_data in scenes_data:
+ print(
+ create_commandline(
+ scene_data, generate_depth=generate_depth, exist_ok=exist_ok
+ )
+ )
+ else:
+ if args.scene == "" or args.output_dir == "":
+ print("Missing scene or output dir argument!")
+ print(parser.format_help())
+ else:
+ generate_multiview_images_for_scene(
+ scene=args.scene,
+ scene_dataset_config_file=args.scene_dataset_config_file,
+ navmesh=args.navmesh,
+ output_dir=args.output_dir,
+ exist_ok=exist_ok,
+ generate_depth=generate_depth,
+ **kwargs,
+ )
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/multiview_habitat_sim_generator.py b/third_party/dust3r/croco/datasets/habitat_sim/multiview_habitat_sim_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd24a9df9ccaefbbd5e7d5996989de1c9d09dc34
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/multiview_habitat_sim_generator.py
@@ -0,0 +1,504 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+import json
+import os
+
+import cv2
+import habitat_sim
+import numpy as np
+import quaternion
+from sklearn.neighbors import NearestNeighbors
+
+# OpenCV to habitat camera convention transformation
+R_OPENCV2HABITAT = np.stack(
+ (habitat_sim.geo.RIGHT, -habitat_sim.geo.UP, habitat_sim.geo.FRONT), axis=0
+)
+R_HABITAT2OPENCV = R_OPENCV2HABITAT.T
+DEG2RAD = np.pi / 180
+
+
+def compute_camera_intrinsics(height, width, hfov):
+ f = width / 2 / np.tan(hfov / 2 * np.pi / 180)
+ cu, cv = width / 2, height / 2
+ return f, cu, cv
+
+
+def compute_camera_pose_opencv_convention(camera_position, camera_orientation):
+ R_cam2world = quaternion.as_rotation_matrix(camera_orientation) @ R_OPENCV2HABITAT
+ t_cam2world = np.asarray(camera_position)
+ return R_cam2world, t_cam2world
+
+
+def compute_pointmap(depthmap, hfov):
+ """Compute a HxWx3 pointmap in camera frame from a HxW depth map."""
+ height, width = depthmap.shape
+ f, cu, cv = compute_camera_intrinsics(height, width, hfov)
+ # Cast depth map to point
+ z_cam = depthmap
+ u, v = np.meshgrid(range(width), range(height))
+ x_cam = (u - cu) / f * z_cam
+ y_cam = (v - cv) / f * z_cam
+ X_cam = np.stack((x_cam, y_cam, z_cam), axis=-1)
+ return X_cam
+
+
+def compute_pointcloud(depthmap, hfov, camera_position, camera_rotation):
+ """Return a 3D point cloud corresponding to valid pixels of the depth map"""
+ R_cam2world, t_cam2world = compute_camera_pose_opencv_convention(
+ camera_position, camera_rotation
+ )
+
+ X_cam = compute_pointmap(depthmap=depthmap, hfov=hfov)
+ valid_mask = X_cam[:, :, 2] != 0.0
+
+ X_cam = X_cam.reshape(-1, 3)[valid_mask.flatten()]
+ X_world = X_cam @ R_cam2world.T + t_cam2world.reshape(1, 3)
+ return X_world
+
+
+def compute_pointcloud_overlaps_scikit(
+ pointcloud1, pointcloud2, distance_threshold, compute_symmetric=False
+):
+ """
+ Compute 'overlapping' metrics based on a distance threshold between two point clouds.
+ """
+ nbrs = NearestNeighbors(n_neighbors=1, algorithm="kd_tree").fit(pointcloud2)
+ distances, indices = nbrs.kneighbors(pointcloud1)
+ intersection1 = np.count_nonzero(distances.flatten() < distance_threshold)
+
+ data = {"intersection1": intersection1, "size1": len(pointcloud1)}
+ if compute_symmetric:
+ nbrs = NearestNeighbors(n_neighbors=1, algorithm="kd_tree").fit(pointcloud1)
+ distances, indices = nbrs.kneighbors(pointcloud2)
+ intersection2 = np.count_nonzero(distances.flatten() < distance_threshold)
+ data["intersection2"] = intersection2
+ data["size2"] = len(pointcloud2)
+
+ return data
+
+
+def _append_camera_parameters(observation, hfov, camera_location, camera_rotation):
+ """
+ Add camera parameters to the observation dictionnary produced by Habitat-Sim
+ In-place modifications.
+ """
+ R_cam2world, t_cam2world = compute_camera_pose_opencv_convention(
+ camera_location, camera_rotation
+ )
+ height, width = observation["depth"].shape
+ f, cu, cv = compute_camera_intrinsics(height, width, hfov)
+ K = np.asarray([[f, 0, cu], [0, f, cv], [0, 0, 1.0]])
+ observation["camera_intrinsics"] = K
+ observation["t_cam2world"] = t_cam2world
+ observation["R_cam2world"] = R_cam2world
+
+
+def look_at(eye, center, up, return_cam2world=True):
+ """
+ Return camera pose looking at a given center point.
+ Analogous of gluLookAt function, using OpenCV camera convention.
+ """
+ z = center - eye
+ z /= np.linalg.norm(z, axis=-1, keepdims=True)
+ y = -up
+ y = y - np.sum(y * z, axis=-1, keepdims=True) * z
+ y /= np.linalg.norm(y, axis=-1, keepdims=True)
+ x = np.cross(y, z, axis=-1)
+
+ if return_cam2world:
+ R = np.stack((x, y, z), axis=-1)
+ t = eye
+ else:
+ # World to camera transformation
+ # Transposed matrix
+ R = np.stack((x, y, z), axis=-2)
+ t = -np.einsum("...ij, ...j", R, eye)
+ return R, t
+
+
+def look_at_for_habitat(eye, center, up, return_cam2world=True):
+ R, t = look_at(eye, center, up)
+ orientation = quaternion.from_rotation_matrix(R @ R_OPENCV2HABITAT.T)
+ return orientation, t
+
+
+def generate_orientation_noise(pan_range, tilt_range, roll_range):
+ return (
+ quaternion.from_rotation_vector(
+ np.random.uniform(*pan_range) * DEG2RAD * habitat_sim.geo.UP
+ )
+ * quaternion.from_rotation_vector(
+ np.random.uniform(*tilt_range) * DEG2RAD * habitat_sim.geo.RIGHT
+ )
+ * quaternion.from_rotation_vector(
+ np.random.uniform(*roll_range) * DEG2RAD * habitat_sim.geo.FRONT
+ )
+ )
+
+
+class NoNaviguableSpaceError(RuntimeError):
+ def __init__(self, *args):
+ super().__init__(*args)
+
+
+class MultiviewHabitatSimGenerator:
+ def __init__(
+ self,
+ scene,
+ navmesh,
+ scene_dataset_config_file,
+ resolution=(240, 320),
+ views_count=2,
+ hfov=60,
+ gpu_id=0,
+ size=10000,
+ minimum_covisibility=0.5,
+ transform=None,
+ ):
+ self.scene = scene
+ self.navmesh = navmesh
+ self.scene_dataset_config_file = scene_dataset_config_file
+ self.resolution = resolution
+ self.views_count = views_count
+ assert self.views_count >= 1
+ self.hfov = hfov
+ self.gpu_id = gpu_id
+ self.size = size
+ self.transform = transform
+
+ # Noise added to camera orientation
+ self.pan_range = (-3, 3)
+ self.tilt_range = (-10, 10)
+ self.roll_range = (-5, 5)
+
+ # Height range to sample cameras
+ self.height_range = (1.2, 1.8)
+
+ # Random steps between the camera views
+ self.random_steps_count = 5
+ self.random_step_variance = 2.0
+
+ # Minimum fraction of the scene which should be valid (well defined depth)
+ self.minimum_valid_fraction = 0.7
+
+ # Distance threshold to see to select pairs
+ self.distance_threshold = 0.05
+ # Minimum IoU of a view point cloud with respect to the reference view to be kept.
+ self.minimum_covisibility = minimum_covisibility
+
+ # Maximum number of retries.
+ self.max_attempts_count = 100
+
+ self.seed = None
+ self._lazy_initialization()
+
+ def _lazy_initialization(self):
+ # Lazy random seeding and instantiation of the simulator to deal with multiprocessing properly
+ if self.seed == None:
+ # Re-seed numpy generator
+ np.random.seed()
+ self.seed = np.random.randint(2**32 - 1)
+ sim_cfg = habitat_sim.SimulatorConfiguration()
+ sim_cfg.scene_id = self.scene
+ if (
+ self.scene_dataset_config_file is not None
+ and self.scene_dataset_config_file != ""
+ ):
+ sim_cfg.scene_dataset_config_file = self.scene_dataset_config_file
+ sim_cfg.random_seed = self.seed
+ sim_cfg.load_semantic_mesh = False
+ sim_cfg.gpu_device_id = self.gpu_id
+
+ depth_sensor_spec = habitat_sim.CameraSensorSpec()
+ depth_sensor_spec.uuid = "depth"
+ depth_sensor_spec.sensor_type = habitat_sim.SensorType.DEPTH
+ depth_sensor_spec.resolution = self.resolution
+ depth_sensor_spec.hfov = self.hfov
+ depth_sensor_spec.position = [0.0, 0.0, 0]
+ depth_sensor_spec.orientation
+
+ rgb_sensor_spec = habitat_sim.CameraSensorSpec()
+ rgb_sensor_spec.uuid = "color"
+ rgb_sensor_spec.sensor_type = habitat_sim.SensorType.COLOR
+ rgb_sensor_spec.resolution = self.resolution
+ rgb_sensor_spec.hfov = self.hfov
+ rgb_sensor_spec.position = [0.0, 0.0, 0]
+ agent_cfg = habitat_sim.agent.AgentConfiguration(
+ sensor_specifications=[rgb_sensor_spec, depth_sensor_spec]
+ )
+
+ cfg = habitat_sim.Configuration(sim_cfg, [agent_cfg])
+ self.sim = habitat_sim.Simulator(cfg)
+ if self.navmesh is not None and self.navmesh != "":
+ # Use pre-computed navmesh when available (usually better than those generated automatically)
+ self.sim.pathfinder.load_nav_mesh(self.navmesh)
+
+ if not self.sim.pathfinder.is_loaded:
+ # Try to compute a navmesh
+ navmesh_settings = habitat_sim.NavMeshSettings()
+ navmesh_settings.set_defaults()
+ self.sim.recompute_navmesh(self.sim.pathfinder, navmesh_settings, True)
+
+ # Ensure that the navmesh is not empty
+ if not self.sim.pathfinder.is_loaded:
+ raise NoNaviguableSpaceError(
+ f"No naviguable location (scene: {self.scene} -- navmesh: {self.navmesh})"
+ )
+
+ self.agent = self.sim.initialize_agent(agent_id=0)
+
+ def close(self):
+ self.sim.close()
+
+ def __del__(self):
+ self.sim.close()
+
+ def __len__(self):
+ return self.size
+
+ def sample_random_viewpoint(self):
+ """Sample a random viewpoint using the navmesh"""
+ nav_point = self.sim.pathfinder.get_random_navigable_point()
+
+ # Sample a random viewpoint height
+ viewpoint_height = np.random.uniform(*self.height_range)
+ viewpoint_position = nav_point + viewpoint_height * habitat_sim.geo.UP
+ viewpoint_orientation = quaternion.from_rotation_vector(
+ np.random.uniform(0, 2 * np.pi) * habitat_sim.geo.UP
+ ) * generate_orientation_noise(self.pan_range, self.tilt_range, self.roll_range)
+ return viewpoint_position, viewpoint_orientation, nav_point
+
+ def sample_other_random_viewpoint(self, observed_point, nav_point):
+ """Sample a random viewpoint close to an existing one, using the navmesh and a reference observed point."""
+ other_nav_point = nav_point
+
+ walk_directions = self.random_step_variance * np.asarray([1, 0, 1])
+ for i in range(self.random_steps_count):
+ temp = self.sim.pathfinder.snap_point(
+ other_nav_point + walk_directions * np.random.normal(size=3)
+ )
+ # Snapping may return nan when it fails
+ if not np.isnan(temp[0]):
+ other_nav_point = temp
+
+ other_viewpoint_height = np.random.uniform(*self.height_range)
+ other_viewpoint_position = (
+ other_nav_point + other_viewpoint_height * habitat_sim.geo.UP
+ )
+
+ # Set viewing direction towards the central point
+ rotation, position = look_at_for_habitat(
+ eye=other_viewpoint_position,
+ center=observed_point,
+ up=habitat_sim.geo.UP,
+ return_cam2world=True,
+ )
+ rotation = rotation * generate_orientation_noise(
+ self.pan_range, self.tilt_range, self.roll_range
+ )
+ return position, rotation, other_nav_point
+
+ def is_other_pointcloud_overlapping(self, ref_pointcloud, other_pointcloud):
+ """Check if a viewpoint is valid and overlaps significantly with a reference one."""
+ # Observation
+ pixels_count = self.resolution[0] * self.resolution[1]
+ valid_fraction = len(other_pointcloud) / pixels_count
+ assert valid_fraction <= 1.0 and valid_fraction >= 0.0
+ overlap = compute_pointcloud_overlaps_scikit(
+ ref_pointcloud,
+ other_pointcloud,
+ self.distance_threshold,
+ compute_symmetric=True,
+ )
+ covisibility = min(
+ overlap["intersection1"] / pixels_count,
+ overlap["intersection2"] / pixels_count,
+ )
+ is_valid = (valid_fraction >= self.minimum_valid_fraction) and (
+ covisibility >= self.minimum_covisibility
+ )
+ return is_valid, valid_fraction, covisibility
+
+ def is_other_viewpoint_overlapping(
+ self, ref_pointcloud, observation, position, rotation
+ ):
+ """Check if a viewpoint is valid and overlaps significantly with a reference one."""
+ # Observation
+ other_pointcloud = compute_pointcloud(
+ observation["depth"], self.hfov, position, rotation
+ )
+ return self.is_other_pointcloud_overlapping(ref_pointcloud, other_pointcloud)
+
+ def render_viewpoint(self, viewpoint_position, viewpoint_orientation):
+ agent_state = habitat_sim.AgentState()
+ agent_state.position = viewpoint_position
+ agent_state.rotation = viewpoint_orientation
+ self.agent.set_state(agent_state)
+ viewpoint_observations = self.sim.get_sensor_observations(agent_ids=0)
+ _append_camera_parameters(
+ viewpoint_observations, self.hfov, viewpoint_position, viewpoint_orientation
+ )
+ return viewpoint_observations
+
+ def __getitem__(self, useless_idx):
+ ref_position, ref_orientation, nav_point = self.sample_random_viewpoint()
+ ref_observations = self.render_viewpoint(ref_position, ref_orientation)
+ # Extract point cloud
+ ref_pointcloud = compute_pointcloud(
+ depthmap=ref_observations["depth"],
+ hfov=self.hfov,
+ camera_position=ref_position,
+ camera_rotation=ref_orientation,
+ )
+
+ pixels_count = self.resolution[0] * self.resolution[1]
+ ref_valid_fraction = len(ref_pointcloud) / pixels_count
+ assert ref_valid_fraction <= 1.0 and ref_valid_fraction >= 0.0
+ if ref_valid_fraction < self.minimum_valid_fraction:
+ # This should produce a recursion error at some point when something is very wrong.
+ return self[0]
+ # Pick an reference observed point in the point cloud
+ observed_point = np.mean(ref_pointcloud, axis=0)
+
+ # Add the first image as reference
+ viewpoints_observations = [ref_observations]
+ viewpoints_covisibility = [ref_valid_fraction]
+ viewpoints_positions = [ref_position]
+ viewpoints_orientations = [quaternion.as_float_array(ref_orientation)]
+ viewpoints_clouds = [ref_pointcloud]
+ viewpoints_valid_fractions = [ref_valid_fraction]
+
+ for _ in range(self.views_count - 1):
+ # Generate an other viewpoint using some dummy random walk
+ successful_sampling = False
+ for sampling_attempt in range(self.max_attempts_count):
+ position, rotation, _ = self.sample_other_random_viewpoint(
+ observed_point, nav_point
+ )
+ # Observation
+ other_viewpoint_observations = self.render_viewpoint(position, rotation)
+ other_pointcloud = compute_pointcloud(
+ other_viewpoint_observations["depth"], self.hfov, position, rotation
+ )
+
+ (
+ is_valid,
+ valid_fraction,
+ covisibility,
+ ) = self.is_other_pointcloud_overlapping(
+ ref_pointcloud, other_pointcloud
+ )
+ if is_valid:
+ successful_sampling = True
+ break
+ if not successful_sampling:
+ print("WARNING: Maximum number of attempts reached.")
+ # Dirty hack, try using a novel original viewpoint
+ return self[0]
+ viewpoints_observations.append(other_viewpoint_observations)
+ viewpoints_covisibility.append(covisibility)
+ viewpoints_positions.append(position)
+ viewpoints_orientations.append(
+ quaternion.as_float_array(rotation)
+ ) # WXYZ convention for the quaternion encoding.
+ viewpoints_clouds.append(other_pointcloud)
+ viewpoints_valid_fractions.append(valid_fraction)
+
+ # Estimate relations between all pairs of images
+ pairwise_visibility_ratios = np.ones(
+ (len(viewpoints_observations), len(viewpoints_observations))
+ )
+ for i in range(len(viewpoints_observations)):
+ pairwise_visibility_ratios[i, i] = viewpoints_valid_fractions[i]
+ for j in range(i + 1, len(viewpoints_observations)):
+ overlap = compute_pointcloud_overlaps_scikit(
+ viewpoints_clouds[i],
+ viewpoints_clouds[j],
+ self.distance_threshold,
+ compute_symmetric=True,
+ )
+ pairwise_visibility_ratios[i, j] = (
+ overlap["intersection1"] / pixels_count
+ )
+ pairwise_visibility_ratios[j, i] = (
+ overlap["intersection2"] / pixels_count
+ )
+
+ # IoU is relative to the image 0
+ data = {
+ "observations": viewpoints_observations,
+ "positions": np.asarray(viewpoints_positions),
+ "orientations": np.asarray(viewpoints_orientations),
+ "covisibility_ratios": np.asarray(viewpoints_covisibility),
+ "valid_fractions": np.asarray(viewpoints_valid_fractions, dtype=float),
+ "pairwise_visibility_ratios": np.asarray(
+ pairwise_visibility_ratios, dtype=float
+ ),
+ }
+
+ if self.transform is not None:
+ data = self.transform(data)
+ return data
+
+ def generate_random_spiral_trajectory(
+ self,
+ images_count=100,
+ max_radius=0.5,
+ half_turns=5,
+ use_constant_orientation=False,
+ ):
+ """
+ Return a list of images corresponding to a spiral trajectory from a random starting point.
+ Useful to generate nice visualisations.
+ Use an even number of half turns to get a nice "C1-continuous" loop effect
+ """
+ ref_position, ref_orientation, navpoint = self.sample_random_viewpoint()
+ ref_observations = self.render_viewpoint(ref_position, ref_orientation)
+ ref_pointcloud = compute_pointcloud(
+ depthmap=ref_observations["depth"],
+ hfov=self.hfov,
+ camera_position=ref_position,
+ camera_rotation=ref_orientation,
+ )
+ pixels_count = self.resolution[0] * self.resolution[1]
+ if len(ref_pointcloud) / pixels_count < self.minimum_valid_fraction:
+ # Dirty hack: ensure that the valid part of the image is significant
+ return self.generate_random_spiral_trajectory(
+ images_count, max_radius, half_turns, use_constant_orientation
+ )
+
+ # Pick an observed point in the point cloud
+ observed_point = np.mean(ref_pointcloud, axis=0)
+ ref_R, ref_t = compute_camera_pose_opencv_convention(
+ ref_position, ref_orientation
+ )
+
+ images = []
+ is_valid = []
+ # Spiral trajectory, use_constant orientation
+ for i, alpha in enumerate(np.linspace(0, 1, images_count)):
+ r = max_radius * np.abs(
+ np.sin(alpha * np.pi)
+ ) # Increase then decrease the radius
+ theta = alpha * half_turns * np.pi
+ x = r * np.cos(theta)
+ y = r * np.sin(theta)
+ z = 0.0
+ position = (
+ ref_position + (ref_R @ np.asarray([x, y, z]).reshape(3, 1)).flatten()
+ )
+ if use_constant_orientation:
+ orientation = ref_orientation
+ else:
+ # trajectory looking at a mean point in front of the ref observation
+ orientation, position = look_at_for_habitat(
+ eye=position, center=observed_point, up=habitat_sim.geo.UP
+ )
+ observations = self.render_viewpoint(position, orientation)
+ images.append(observations["color"][..., :3])
+ _is_valid, valid_fraction, iou = self.is_other_viewpoint_overlapping(
+ ref_pointcloud, observations, position, orientation
+ )
+ is_valid.append(_is_valid)
+ return images, np.all(is_valid)
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/pack_metadata_files.py b/third_party/dust3r/croco/datasets/habitat_sim/pack_metadata_files.py
new file mode 100644
index 0000000000000000000000000000000000000000..92f1b367747417fd4625a5a4f975f10527247076
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/pack_metadata_files.py
@@ -0,0 +1,81 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+"""
+Utility script to pack metadata files of the dataset in order to be able to re-generate it elsewhere.
+"""
+import argparse
+import collections
+import glob
+import json
+import os
+import shutil
+
+from datasets.habitat_sim.paths import *
+from tqdm import tqdm
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("input_dir")
+ parser.add_argument("output_dir")
+ args = parser.parse_args()
+
+ input_dirname = args.input_dir
+ output_dirname = args.output_dir
+
+ input_metadata_filenames = glob.iglob(
+ f"{input_dirname}/**/metadata.json", recursive=True
+ )
+
+ images_count = collections.defaultdict(lambda: 0)
+
+ os.makedirs(output_dirname)
+ for input_filename in tqdm(input_metadata_filenames):
+ # Ignore empty files
+ with open(input_filename, "r") as f:
+ original_metadata = json.load(f)
+ if (
+ "multiviews" not in original_metadata
+ or len(original_metadata["multiviews"]) == 0
+ ):
+ print("No views in", input_filename)
+ continue
+
+ relpath = os.path.relpath(input_filename, input_dirname)
+ print(relpath)
+
+ # Copy metadata, while replacing scene paths by generic keys depending on the dataset, for portability.
+ # Data paths are sorted by decreasing length to avoid potential bugs due to paths starting by the same string pattern.
+ scenes_dataset_paths = dict(
+ sorted(SCENES_DATASET.items(), key=lambda x: len(x[1]), reverse=True)
+ )
+ metadata = dict()
+ for key, value in original_metadata.items():
+ if key in ("scene_dataset_config_file", "scene", "navmesh") and value != "":
+ known_path = False
+ for dataset, dataset_path in scenes_dataset_paths.items():
+ if value.startswith(dataset_path):
+ value = os.path.join(
+ dataset, os.path.relpath(value, dataset_path)
+ )
+ known_path = True
+ break
+ if not known_path:
+ raise KeyError("Unknown path:" + value)
+ metadata[key] = value
+
+ # Compile some general statistics while packing data
+ scene_split = metadata["scene"].split("/")
+ upper_level = (
+ "/".join(scene_split[:2]) if scene_split[0] == "hm3d" else scene_split[0]
+ )
+ images_count[upper_level] += len(metadata["multiviews"])
+
+ output_filename = os.path.join(output_dirname, relpath)
+ os.makedirs(os.path.dirname(output_filename), exist_ok=True)
+ with open(output_filename, "w") as f:
+ json.dump(metadata, f)
+
+ # Print statistics
+ print("Images count:")
+ for upper_level, count in images_count.items():
+ print(f"- {upper_level}: {count}")
diff --git a/third_party/dust3r/croco/datasets/habitat_sim/paths.py b/third_party/dust3r/croco/datasets/habitat_sim/paths.py
new file mode 100644
index 0000000000000000000000000000000000000000..aad8e2257c41bac69245e66cd956736f3a848edc
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/habitat_sim/paths.py
@@ -0,0 +1,179 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+"""
+Paths to Habitat-Sim scenes
+"""
+
+import collections
+import json
+import os
+
+from tqdm import tqdm
+
+# Hardcoded path to the different scene datasets
+SCENES_DATASET = {
+ "hm3d": "./data/habitat-sim-data/scene_datasets/hm3d/",
+ "gibson": "./data/habitat-sim-data/scene_datasets/gibson/",
+ "habitat-test-scenes": "./data/habitat-sim/scene_datasets/habitat-test-scenes/",
+ "replica_cad_baked_lighting": "./data/habitat-sim/scene_datasets/replica_cad_baked_lighting/",
+ "replica_cad": "./data/habitat-sim/scene_datasets/replica_cad/",
+ "replica": "./data/habitat-sim/scene_datasets/ReplicaDataset/",
+ "scannet": "./data/habitat-sim/scene_datasets/scannet/",
+}
+
+SceneData = collections.namedtuple(
+ "SceneData", ["scene_dataset_config_file", "scene", "navmesh", "output_dir"]
+)
+
+
+def list_replicacad_scenes(base_output_dir, base_path=SCENES_DATASET["replica_cad"]):
+ scene_dataset_config_file = os.path.join(
+ base_path, "replicaCAD.scene_dataset_config.json"
+ )
+ scenes = [f"apt_{i}" for i in range(6)] + ["empty_stage"]
+ navmeshes = [f"navmeshes/apt_{i}_static_furniture.navmesh" for i in range(6)] + [
+ "empty_stage.navmesh"
+ ]
+ scenes_data = []
+ for idx in range(len(scenes)):
+ output_dir = os.path.join(base_output_dir, "ReplicaCAD", scenes[idx])
+ # Add scene
+ data = SceneData(
+ scene_dataset_config_file=scene_dataset_config_file,
+ scene=scenes[idx] + ".scene_instance.json",
+ navmesh=os.path.join(base_path, navmeshes[idx]),
+ output_dir=output_dir,
+ )
+ scenes_data.append(data)
+ return scenes_data
+
+
+def list_replica_cad_baked_lighting_scenes(
+ base_output_dir, base_path=SCENES_DATASET["replica_cad_baked_lighting"]
+):
+ scene_dataset_config_file = os.path.join(
+ base_path, "replicaCAD_baked.scene_dataset_config.json"
+ )
+ scenes = sum(
+ [[f"Baked_sc{i}_staging_{j:02}" for i in range(5)] for j in range(21)], []
+ )
+ navmeshes = "" # [f"navmeshes/apt_{i}_static_furniture.navmesh" for i in range(6)] + ["empty_stage.navmesh"]
+ scenes_data = []
+ for idx in range(len(scenes)):
+ output_dir = os.path.join(
+ base_output_dir, "replica_cad_baked_lighting", scenes[idx]
+ )
+ data = SceneData(
+ scene_dataset_config_file=scene_dataset_config_file,
+ scene=scenes[idx],
+ navmesh="",
+ output_dir=output_dir,
+ )
+ scenes_data.append(data)
+ return scenes_data
+
+
+def list_replica_scenes(base_output_dir, base_path):
+ scenes_data = []
+ for scene_id in os.listdir(base_path):
+ scene = os.path.join(base_path, scene_id, "mesh.ply")
+ navmesh = os.path.join(
+ base_path, scene_id, "habitat/mesh_preseg_semantic.navmesh"
+ ) # Not sure if I should use it
+ scene_dataset_config_file = ""
+ output_dir = os.path.join(base_output_dir, scene_id)
+ # Add scene only if it does not exist already, or if exist_ok
+ data = SceneData(
+ scene_dataset_config_file=scene_dataset_config_file,
+ scene=scene,
+ navmesh=navmesh,
+ output_dir=output_dir,
+ )
+ scenes_data.append(data)
+ return scenes_data
+
+
+def list_scenes(base_output_dir, base_path):
+ """
+ Generic method iterating through a base_path folder to find scenes.
+ """
+ scenes_data = []
+ for root, dirs, files in os.walk(base_path, followlinks=True):
+ folder_scenes_data = []
+ for file in files:
+ name, ext = os.path.splitext(file)
+ if ext == ".glb":
+ scene = os.path.join(root, name + ".glb")
+ navmesh = os.path.join(root, name + ".navmesh")
+ if not os.path.exists(navmesh):
+ navmesh = ""
+ relpath = os.path.relpath(root, base_path)
+ output_dir = os.path.abspath(
+ os.path.join(base_output_dir, relpath, name)
+ )
+ data = SceneData(
+ scene_dataset_config_file="",
+ scene=scene,
+ navmesh=navmesh,
+ output_dir=output_dir,
+ )
+ folder_scenes_data.append(data)
+
+ # Specific check for HM3D:
+ # When two meshesxxxx.basis.glb and xxxx.glb are present, use the 'basis' version.
+ basis_scenes = [
+ data.scene[: -len(".basis.glb")]
+ for data in folder_scenes_data
+ if data.scene.endswith(".basis.glb")
+ ]
+ if len(basis_scenes) != 0:
+ folder_scenes_data = [
+ data
+ for data in folder_scenes_data
+ if not (data.scene[: -len(".glb")] in basis_scenes)
+ ]
+
+ scenes_data.extend(folder_scenes_data)
+ return scenes_data
+
+
+def list_scenes_available(base_output_dir, scenes_dataset_paths=SCENES_DATASET):
+ scenes_data = []
+
+ # HM3D
+ for split in ("minival", "train", "val", "examples"):
+ scenes_data += list_scenes(
+ base_output_dir=os.path.join(base_output_dir, f"hm3d/{split}/"),
+ base_path=f"{scenes_dataset_paths['hm3d']}/{split}",
+ )
+
+ # Gibson
+ scenes_data += list_scenes(
+ base_output_dir=os.path.join(base_output_dir, "gibson"),
+ base_path=scenes_dataset_paths["gibson"],
+ )
+
+ # Habitat test scenes (just a few)
+ scenes_data += list_scenes(
+ base_output_dir=os.path.join(base_output_dir, "habitat-test-scenes"),
+ base_path=scenes_dataset_paths["habitat-test-scenes"],
+ )
+
+ # ReplicaCAD (baked lightning)
+ scenes_data += list_replica_cad_baked_lighting_scenes(
+ base_output_dir=base_output_dir
+ )
+
+ # ScanNet
+ scenes_data += list_scenes(
+ base_output_dir=os.path.join(base_output_dir, "scannet"),
+ base_path=scenes_dataset_paths["scannet"],
+ )
+
+ # Replica
+ list_replica_scenes(
+ base_output_dir=os.path.join(base_output_dir, "replica"),
+ base_path=scenes_dataset_paths["replica"],
+ )
+ return scenes_data
diff --git a/third_party/dust3r/croco/datasets/pairs_dataset.py b/third_party/dust3r/croco/datasets/pairs_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..892a37a71b2296e486d83930e5dac2b690f83c7b
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/pairs_dataset.py
@@ -0,0 +1,161 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+import os
+
+from datasets.transforms import get_pair_transforms
+from PIL import Image
+from torch.utils.data import Dataset
+
+
+def load_image(impath):
+ return Image.open(impath)
+
+
+def load_pairs_from_cache_file(fname, root=""):
+ assert os.path.isfile(
+ fname
+ ), "cannot parse pairs from {:s}, file does not exist".format(fname)
+ with open(fname, "r") as fid:
+ lines = fid.read().strip().splitlines()
+ pairs = [
+ (os.path.join(root, l.split()[0]), os.path.join(root, l.split()[1]))
+ for l in lines
+ ]
+ return pairs
+
+
+def load_pairs_from_list_file(fname, root=""):
+ assert os.path.isfile(
+ fname
+ ), "cannot parse pairs from {:s}, file does not exist".format(fname)
+ with open(fname, "r") as fid:
+ lines = fid.read().strip().splitlines()
+ pairs = [
+ (os.path.join(root, l + "_1.jpg"), os.path.join(root, l + "_2.jpg"))
+ for l in lines
+ if not l.startswith("#")
+ ]
+ return pairs
+
+
+def write_cache_file(fname, pairs, root=""):
+ if len(root) > 0:
+ if not root.endswith("/"):
+ root += "/"
+ assert os.path.isdir(root)
+ s = ""
+ for im1, im2 in pairs:
+ if len(root) > 0:
+ assert im1.startswith(root), im1
+ assert im2.startswith(root), im2
+ s += "{:s} {:s}\n".format(im1[len(root) :], im2[len(root) :])
+ with open(fname, "w") as fid:
+ fid.write(s[:-1])
+
+
+def parse_and_cache_all_pairs(dname, data_dir="./data/"):
+ if dname == "habitat_release":
+ dirname = os.path.join(data_dir, "habitat_release")
+ assert os.path.isdir(dirname), (
+ "cannot find folder for habitat_release pairs: " + dirname
+ )
+ cache_file = os.path.join(dirname, "pairs.txt")
+ assert not os.path.isfile(cache_file), (
+ "cache file already exists: " + cache_file
+ )
+
+ print("Parsing pairs for dataset: " + dname)
+ pairs = []
+ for root, dirs, files in os.walk(dirname):
+ if "val" in root:
+ continue
+ dirs.sort()
+ pairs += [
+ (
+ os.path.join(root, f),
+ os.path.join(root, f[: -len("_1.jpeg")] + "_2.jpeg"),
+ )
+ for f in sorted(files)
+ if f.endswith("_1.jpeg")
+ ]
+ print("Found {:,} pairs".format(len(pairs)))
+ print("Writing cache to: " + cache_file)
+ write_cache_file(cache_file, pairs, root=dirname)
+
+ else:
+ raise NotImplementedError("Unknown dataset: " + dname)
+
+
+def dnames_to_image_pairs(dnames, data_dir="./data/"):
+ """
+ dnames: list of datasets with image pairs, separated by +
+ """
+ all_pairs = []
+ for dname in dnames.split("+"):
+ if dname == "habitat_release":
+ dirname = os.path.join(data_dir, "habitat_release")
+ assert os.path.isdir(dirname), (
+ "cannot find folder for habitat_release pairs: " + dirname
+ )
+ cache_file = os.path.join(dirname, "pairs.txt")
+ assert os.path.isfile(cache_file), (
+ "cannot find cache file for habitat_release pairs, please first create the cache file, see instructions. "
+ + cache_file
+ )
+ pairs = load_pairs_from_cache_file(cache_file, root=dirname)
+ elif dname in ["ARKitScenes", "MegaDepth", "3DStreetView", "IndoorVL"]:
+ dirname = os.path.join(data_dir, dname + "_crops")
+ assert os.path.isdir(
+ dirname
+ ), "cannot find folder for {:s} pairs: {:s}".format(dname, dirname)
+ list_file = os.path.join(dirname, "listing.txt")
+ assert os.path.isfile(
+ list_file
+ ), "cannot find list file for {:s} pairs, see instructions. {:s}".format(
+ dname, list_file
+ )
+ pairs = load_pairs_from_list_file(list_file, root=dirname)
+ print(" {:s}: {:,} pairs".format(dname, len(pairs)))
+ all_pairs += pairs
+ if "+" in dnames:
+ print(" Total: {:,} pairs".format(len(all_pairs)))
+ return all_pairs
+
+
+class PairsDataset(Dataset):
+ def __init__(
+ self, dnames, trfs="", totensor=True, normalize=True, data_dir="./data/"
+ ):
+ super().__init__()
+ self.image_pairs = dnames_to_image_pairs(dnames, data_dir=data_dir)
+ self.transforms = get_pair_transforms(
+ transform_str=trfs, totensor=totensor, normalize=normalize
+ )
+
+ def __len__(self):
+ return len(self.image_pairs)
+
+ def __getitem__(self, index):
+ im1path, im2path = self.image_pairs[index]
+ im1 = load_image(im1path)
+ im2 = load_image(im2path)
+ if self.transforms is not None:
+ im1, im2 = self.transforms(im1, im2)
+ return im1, im2
+
+
+if __name__ == "__main__":
+ import argparse
+
+ parser = argparse.ArgumentParser(
+ prog="Computing and caching list of pairs for a given dataset"
+ )
+ parser.add_argument(
+ "--data_dir", default="./data/", type=str, help="path where data are stored"
+ )
+ parser.add_argument(
+ "--dataset", default="habitat_release", type=str, help="name of the dataset"
+ )
+ args = parser.parse_args()
+ parse_and_cache_all_pairs(dname=args.dataset, data_dir=args.data_dir)
diff --git a/third_party/dust3r/croco/datasets/transforms.py b/third_party/dust3r/croco/datasets/transforms.py
new file mode 100644
index 0000000000000000000000000000000000000000..30d8adccbdb5bdd083bab88211494c1682f88352
--- /dev/null
+++ b/third_party/dust3r/croco/datasets/transforms.py
@@ -0,0 +1,138 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+import torch
+import torchvision.transforms
+import torchvision.transforms.functional as F
+
+# "Pair": apply a transform on a pair
+# "Both": apply the exact same transform to both images
+
+
+class ComposePair(torchvision.transforms.Compose):
+ def __call__(self, img1, img2):
+ for t in self.transforms:
+ img1, img2 = t(img1, img2)
+ return img1, img2
+
+
+class NormalizeBoth(torchvision.transforms.Normalize):
+ def forward(self, img1, img2):
+ img1 = super().forward(img1)
+ img2 = super().forward(img2)
+ return img1, img2
+
+
+class ToTensorBoth(torchvision.transforms.ToTensor):
+ def __call__(self, img1, img2):
+ img1 = super().__call__(img1)
+ img2 = super().__call__(img2)
+ return img1, img2
+
+
+class RandomCropPair(torchvision.transforms.RandomCrop):
+ # the crop will be intentionally different for the two images with this class
+ def forward(self, img1, img2):
+ img1 = super().forward(img1)
+ img2 = super().forward(img2)
+ return img1, img2
+
+
+class ColorJitterPair(torchvision.transforms.ColorJitter):
+ # can be symmetric (same for both images) or assymetric (different jitter params for each image) depending on assymetric_prob
+ def __init__(self, assymetric_prob, **kwargs):
+ super().__init__(**kwargs)
+ self.assymetric_prob = assymetric_prob
+
+ def jitter_one(
+ self,
+ img,
+ fn_idx,
+ brightness_factor,
+ contrast_factor,
+ saturation_factor,
+ hue_factor,
+ ):
+ for fn_id in fn_idx:
+ if fn_id == 0 and brightness_factor is not None:
+ img = F.adjust_brightness(img, brightness_factor)
+ elif fn_id == 1 and contrast_factor is not None:
+ img = F.adjust_contrast(img, contrast_factor)
+ elif fn_id == 2 and saturation_factor is not None:
+ img = F.adjust_saturation(img, saturation_factor)
+ elif fn_id == 3 and hue_factor is not None:
+ img = F.adjust_hue(img, hue_factor)
+ return img
+
+ def forward(self, img1, img2):
+ (
+ fn_idx,
+ brightness_factor,
+ contrast_factor,
+ saturation_factor,
+ hue_factor,
+ ) = self.get_params(self.brightness, self.contrast, self.saturation, self.hue)
+ img1 = self.jitter_one(
+ img1,
+ fn_idx,
+ brightness_factor,
+ contrast_factor,
+ saturation_factor,
+ hue_factor,
+ )
+ if torch.rand(1) < self.assymetric_prob: # assymetric:
+ (
+ fn_idx,
+ brightness_factor,
+ contrast_factor,
+ saturation_factor,
+ hue_factor,
+ ) = self.get_params(
+ self.brightness, self.contrast, self.saturation, self.hue
+ )
+ img2 = self.jitter_one(
+ img2,
+ fn_idx,
+ brightness_factor,
+ contrast_factor,
+ saturation_factor,
+ hue_factor,
+ )
+ return img1, img2
+
+
+def get_pair_transforms(transform_str, totensor=True, normalize=True):
+ # transform_str is eg crop224+color
+ trfs = []
+ for s in transform_str.split("+"):
+ if s.startswith("crop"):
+ size = int(s[len("crop") :])
+ trfs.append(RandomCropPair(size))
+ elif s == "acolor":
+ trfs.append(
+ ColorJitterPair(
+ assymetric_prob=1.0,
+ brightness=(0.6, 1.4),
+ contrast=(0.6, 1.4),
+ saturation=(0.6, 1.4),
+ hue=0.0,
+ )
+ )
+ elif s == "": # if transform_str was ""
+ pass
+ else:
+ raise NotImplementedError("Unknown augmentation: " + s)
+
+ if totensor:
+ trfs.append(ToTensorBoth())
+ if normalize:
+ trfs.append(
+ NormalizeBoth(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+ )
+
+ if len(trfs) == 0:
+ return None
+ elif len(trfs) == 1:
+ return trfs
+ else:
+ return ComposePair(trfs)
diff --git a/third_party/dust3r/croco/demo.py b/third_party/dust3r/croco/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..8064477fb3a81e1861e77c31b5b1ec94e0ca53d2
--- /dev/null
+++ b/third_party/dust3r/croco/demo.py
@@ -0,0 +1,78 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+import torch
+import torchvision.transforms
+from models.croco import CroCoNet
+from PIL import Image
+from torchvision.transforms import Compose, Normalize, ToTensor
+
+
+def main():
+ device = torch.device(
+ "cuda:0"
+ if torch.cuda.is_available() and torch.cuda.device_count() > 0
+ else "cpu"
+ )
+
+ # load 224x224 images and transform them to tensor
+ imagenet_mean = [0.485, 0.456, 0.406]
+ imagenet_mean_tensor = (
+ torch.tensor(imagenet_mean).view(1, 3, 1, 1).to(device, non_blocking=True)
+ )
+ imagenet_std = [0.229, 0.224, 0.225]
+ imagenet_std_tensor = (
+ torch.tensor(imagenet_std).view(1, 3, 1, 1).to(device, non_blocking=True)
+ )
+ trfs = Compose([ToTensor(), Normalize(mean=imagenet_mean, std=imagenet_std)])
+ image1 = (
+ trfs(Image.open("assets/Chateau1.png").convert("RGB"))
+ .to(device, non_blocking=True)
+ .unsqueeze(0)
+ )
+ image2 = (
+ trfs(Image.open("assets/Chateau2.png").convert("RGB"))
+ .to(device, non_blocking=True)
+ .unsqueeze(0)
+ )
+
+ # load model
+ ckpt = torch.load("pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth", "cpu")
+ model = CroCoNet(**ckpt.get("croco_kwargs", {})).to(device)
+ model.eval()
+ msg = model.load_state_dict(ckpt["model"], strict=True)
+
+ # forward
+ with torch.inference_mode():
+ out, mask, target = model(image1, image2)
+
+ # the output is normalized, thus use the mean/std of the actual image to go back to RGB space
+ patchified = model.patchify(image1)
+ mean = patchified.mean(dim=-1, keepdim=True)
+ var = patchified.var(dim=-1, keepdim=True)
+ decoded_image = model.unpatchify(out * (var + 1.0e-6) ** 0.5 + mean)
+ # undo imagenet normalization, prepare masked image
+ decoded_image = decoded_image * imagenet_std_tensor + imagenet_mean_tensor
+ input_image = image1 * imagenet_std_tensor + imagenet_mean_tensor
+ ref_image = image2 * imagenet_std_tensor + imagenet_mean_tensor
+ image_masks = model.unpatchify(
+ model.patchify(torch.ones_like(ref_image)) * mask[:, :, None]
+ )
+ masked_input_image = (1 - image_masks) * input_image
+
+ # make visualization
+ visualization = torch.cat(
+ (ref_image, masked_input_image, decoded_image, input_image), dim=3
+ ) # 4*(B, 3, H, W) -> B, 3, H, W*4
+ B, C, H, W = visualization.shape
+ visualization = visualization.permute(1, 0, 2, 3).reshape(C, B * H, W)
+ visualization = torchvision.transforms.functional.to_pil_image(
+ torch.clamp(visualization, 0, 1)
+ )
+ fname = "demo_output.png"
+ visualization.save(fname)
+ print("Visualization save in " + fname)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/third_party/dust3r/croco/interactive_demo.ipynb b/third_party/dust3r/croco/interactive_demo.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..6cfc960af5baac9a69029c29a16eea4e24123a71
--- /dev/null
+++ b/third_party/dust3r/croco/interactive_demo.ipynb
@@ -0,0 +1,271 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Interactive demo of Cross-view Completion."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Copyright (C) 2022-present Naver Corporation. All rights reserved.\n",
+ "# Licensed under CC BY-NC-SA 4.0 (non-commercial use only)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "import numpy as np\n",
+ "from models.croco import CroCoNet\n",
+ "from ipywidgets import interact, interactive, fixed, interact_manual\n",
+ "import ipywidgets as widgets\n",
+ "import matplotlib.pyplot as plt\n",
+ "import quaternion\n",
+ "import models.masking"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Load CroCo model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ckpt = torch.load('pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth', 'cpu')\n",
+ "model = CroCoNet( **ckpt.get('croco_kwargs',{}))\n",
+ "msg = model.load_state_dict(ckpt['model'], strict=True)\n",
+ "use_gpu = torch.cuda.is_available() and torch.cuda.device_count()>0\n",
+ "device = torch.device('cuda:0' if use_gpu else 'cpu')\n",
+ "model = model.eval()\n",
+ "model = model.to(device=device)\n",
+ "print(msg)\n",
+ "\n",
+ "def process_images(ref_image, target_image, masking_ratio, reconstruct_unmasked_patches=False):\n",
+ " \"\"\"\n",
+ " Perform Cross-View completion using two input images, specified using Numpy arrays.\n",
+ " \"\"\"\n",
+ " # Replace the mask generator\n",
+ " model.mask_generator = models.masking.RandomMask(model.patch_embed.num_patches, masking_ratio)\n",
+ "\n",
+ " # ImageNet-1k color normalization\n",
+ " imagenet_mean = torch.as_tensor([0.485, 0.456, 0.406]).reshape(1,3,1,1).to(device)\n",
+ " imagenet_std = torch.as_tensor([0.229, 0.224, 0.225]).reshape(1,3,1,1).to(device)\n",
+ "\n",
+ " normalize_input_colors = True\n",
+ " is_output_normalized = True\n",
+ " with torch.no_grad():\n",
+ " # Cast data to torch\n",
+ " target_image = (torch.as_tensor(target_image, dtype=torch.float, device=device).permute(2,0,1) / 255)[None]\n",
+ " ref_image = (torch.as_tensor(ref_image, dtype=torch.float, device=device).permute(2,0,1) / 255)[None]\n",
+ "\n",
+ " if normalize_input_colors:\n",
+ " ref_image = (ref_image - imagenet_mean) / imagenet_std\n",
+ " target_image = (target_image - imagenet_mean) / imagenet_std\n",
+ "\n",
+ " out, mask, _ = model(target_image, ref_image)\n",
+ " # # get target\n",
+ " if not is_output_normalized:\n",
+ " predicted_image = model.unpatchify(out)\n",
+ " else:\n",
+ " # The output only contains higher order information,\n",
+ " # we retrieve mean and standard deviation from the actual target image\n",
+ " patchified = model.patchify(target_image)\n",
+ " mean = patchified.mean(dim=-1, keepdim=True)\n",
+ " var = patchified.var(dim=-1, keepdim=True)\n",
+ " pred_renorm = out * (var + 1.e-6)**.5 + mean\n",
+ " predicted_image = model.unpatchify(pred_renorm)\n",
+ "\n",
+ " image_masks = model.unpatchify(model.patchify(torch.ones_like(ref_image)) * mask[:,:,None])\n",
+ " masked_target_image = (1 - image_masks) * target_image\n",
+ " \n",
+ " if not reconstruct_unmasked_patches:\n",
+ " # Replace unmasked patches by their actual values\n",
+ " predicted_image = predicted_image * image_masks + masked_target_image\n",
+ "\n",
+ " # Unapply color normalization\n",
+ " if normalize_input_colors:\n",
+ " predicted_image = predicted_image * imagenet_std + imagenet_mean\n",
+ " masked_target_image = masked_target_image * imagenet_std + imagenet_mean\n",
+ " \n",
+ " # Cast to Numpy\n",
+ " masked_target_image = np.asarray(torch.clamp(masked_target_image.squeeze(0).permute(1,2,0) * 255, 0, 255).cpu().numpy(), dtype=np.uint8)\n",
+ " predicted_image = np.asarray(torch.clamp(predicted_image.squeeze(0).permute(1,2,0) * 255, 0, 255).cpu().numpy(), dtype=np.uint8)\n",
+ " return masked_target_image, predicted_image"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Use the Habitat simulator to render images from arbitrary viewpoints (requires habitat_sim to be installed)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "os.environ[\"MAGNUM_LOG\"]=\"quiet\"\n",
+ "os.environ[\"HABITAT_SIM_LOG\"]=\"quiet\"\n",
+ "import habitat_sim\n",
+ "\n",
+ "scene = \"habitat-sim-data/scene_datasets/habitat-test-scenes/skokloster-castle.glb\"\n",
+ "navmesh = \"habitat-sim-data/scene_datasets/habitat-test-scenes/skokloster-castle.navmesh\"\n",
+ "\n",
+ "sim_cfg = habitat_sim.SimulatorConfiguration()\n",
+ "if use_gpu: sim_cfg.gpu_device_id = 0\n",
+ "sim_cfg.scene_id = scene\n",
+ "sim_cfg.load_semantic_mesh = False\n",
+ "rgb_sensor_spec = habitat_sim.CameraSensorSpec()\n",
+ "rgb_sensor_spec.uuid = \"color\"\n",
+ "rgb_sensor_spec.sensor_type = habitat_sim.SensorType.COLOR\n",
+ "rgb_sensor_spec.resolution = (224,224)\n",
+ "rgb_sensor_spec.hfov = 56.56\n",
+ "rgb_sensor_spec.position = [0.0, 0.0, 0.0]\n",
+ "rgb_sensor_spec.orientation = [0, 0, 0]\n",
+ "agent_cfg = habitat_sim.agent.AgentConfiguration(sensor_specifications=[rgb_sensor_spec])\n",
+ "\n",
+ "\n",
+ "cfg = habitat_sim.Configuration(sim_cfg, [agent_cfg])\n",
+ "sim = habitat_sim.Simulator(cfg)\n",
+ "if navmesh is not None:\n",
+ " sim.pathfinder.load_nav_mesh(navmesh)\n",
+ "agent = sim.initialize_agent(agent_id=0)\n",
+ "\n",
+ "def sample_random_viewpoint():\n",
+ " \"\"\" Sample a random viewpoint using the navmesh \"\"\"\n",
+ " nav_point = sim.pathfinder.get_random_navigable_point()\n",
+ " # Sample a random viewpoint height\n",
+ " viewpoint_height = np.random.uniform(1.0, 1.6)\n",
+ " viewpoint_position = nav_point + viewpoint_height * habitat_sim.geo.UP\n",
+ " viewpoint_orientation = quaternion.from_rotation_vector(np.random.uniform(-np.pi, np.pi) * habitat_sim.geo.UP)\n",
+ " return viewpoint_position, viewpoint_orientation\n",
+ "\n",
+ "def render_viewpoint(position, orientation):\n",
+ " agent_state = habitat_sim.AgentState()\n",
+ " agent_state.position = position\n",
+ " agent_state.rotation = orientation\n",
+ " agent.set_state(agent_state)\n",
+ " viewpoint_observations = sim.get_sensor_observations(agent_ids=0)\n",
+ " image = viewpoint_observations['color'][:,:,:3]\n",
+ " image = np.asarray(np.clip(1.5 * np.asarray(image, dtype=float), 0, 255), dtype=np.uint8)\n",
+ " return image"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Sample a random reference view"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ref_position, ref_orientation = sample_random_viewpoint()\n",
+ "ref_image = render_viewpoint(ref_position, ref_orientation)\n",
+ "plt.clf()\n",
+ "fig, axes = plt.subplots(1,1, squeeze=False, num=1)\n",
+ "axes[0,0].imshow(ref_image)\n",
+ "for ax in axes.flatten():\n",
+ " ax.set_xticks([])\n",
+ " ax.set_yticks([])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Interactive cross-view completion using CroCo"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "reconstruct_unmasked_patches = False\n",
+ "\n",
+ "def show_demo(masking_ratio, x, y, z, panorama, elevation):\n",
+ " R = quaternion.as_rotation_matrix(ref_orientation)\n",
+ " target_position = ref_position + x * R[:,0] + y * R[:,1] + z * R[:,2]\n",
+ " target_orientation = (ref_orientation\n",
+ " * quaternion.from_rotation_vector(-elevation * np.pi/180 * habitat_sim.geo.LEFT) \n",
+ " * quaternion.from_rotation_vector(-panorama * np.pi/180 * habitat_sim.geo.UP))\n",
+ " \n",
+ " ref_image = render_viewpoint(ref_position, ref_orientation)\n",
+ " target_image = render_viewpoint(target_position, target_orientation)\n",
+ "\n",
+ " masked_target_image, predicted_image = process_images(ref_image, target_image, masking_ratio, reconstruct_unmasked_patches)\n",
+ "\n",
+ " fig, axes = plt.subplots(1,4, squeeze=True, dpi=300)\n",
+ " axes[0].imshow(ref_image)\n",
+ " axes[0].set_xlabel(\"Reference\")\n",
+ " axes[1].imshow(masked_target_image)\n",
+ " axes[1].set_xlabel(\"Masked target\")\n",
+ " axes[2].imshow(predicted_image)\n",
+ " axes[2].set_xlabel(\"Reconstruction\") \n",
+ " axes[3].imshow(target_image)\n",
+ " axes[3].set_xlabel(\"Target\")\n",
+ " for ax in axes.flatten():\n",
+ " ax.set_xticks([])\n",
+ " ax.set_yticks([])\n",
+ "\n",
+ "interact(show_demo,\n",
+ " masking_ratio=widgets.FloatSlider(description='masking', value=0.9, min=0.0, max=1.0),\n",
+ " x=widgets.FloatSlider(value=0.0, min=-0.5, max=0.5, step=0.05),\n",
+ " y=widgets.FloatSlider(value=0.0, min=-0.5, max=0.5, step=0.05),\n",
+ " z=widgets.FloatSlider(value=0.0, min=-0.5, max=0.5, step=0.05),\n",
+ " panorama=widgets.FloatSlider(value=0.0, min=-20, max=20, step=0.5),\n",
+ " elevation=widgets.FloatSlider(value=0.0, min=-20, max=20, step=0.5));"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.13"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "f9237820cd248d7e07cb4fb9f0e4508a85d642f19d831560c0a4b61f3e907e67"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/third_party/dust3r/croco/models/blocks.py b/third_party/dust3r/croco/models/blocks.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7e6a4157a5db71585d0d86718f1e5f4487828ad
--- /dev/null
+++ b/third_party/dust3r/croco/models/blocks.py
@@ -0,0 +1,349 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+
+# --------------------------------------------------------
+# Main encoder/decoder blocks
+# --------------------------------------------------------
+# References:
+# timm
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/helpers.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/drop.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/mlp.py
+# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/patch_embed.py
+
+
+import collections.abc
+from itertools import repeat
+
+import torch
+import torch.nn as nn
+
+
+def _ntuple(n):
+ def parse(x):
+ if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
+ return x
+ return tuple(repeat(x, n))
+
+ return parse
+
+
+to_2tuple = _ntuple(2)
+
+
+def drop_path(
+ x, drop_prob: float = 0.0, training: bool = False, scale_by_keep: bool = True
+):
+ """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+ if drop_prob == 0.0 or not training:
+ return x
+ keep_prob = 1 - drop_prob
+ shape = (x.shape[0],) + (1,) * (
+ x.ndim - 1
+ ) # work with diff dim tensors, not just 2D ConvNets
+ random_tensor = x.new_empty(shape).bernoulli_(keep_prob)
+ if keep_prob > 0.0 and scale_by_keep:
+ random_tensor.div_(keep_prob)
+ return x * random_tensor
+
+
+class DropPath(nn.Module):
+ """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+ def __init__(self, drop_prob: float = 0.0, scale_by_keep: bool = True):
+ super(DropPath, self).__init__()
+ self.drop_prob = drop_prob
+ self.scale_by_keep = scale_by_keep
+
+ def forward(self, x):
+ return drop_path(x, self.drop_prob, self.training, self.scale_by_keep)
+
+ def extra_repr(self):
+ return f"drop_prob={round(self.drop_prob,3):0.3f}"
+
+
+class Mlp(nn.Module):
+ """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
+
+ def __init__(
+ self,
+ in_features,
+ hidden_features=None,
+ out_features=None,
+ act_layer=nn.GELU,
+ bias=True,
+ drop=0.0,
+ ):
+ super().__init__()
+ out_features = out_features or in_features
+ hidden_features = hidden_features or in_features
+ bias = to_2tuple(bias)
+ drop_probs = to_2tuple(drop)
+
+ self.fc1 = nn.Linear(in_features, hidden_features, bias=bias[0])
+ self.act = act_layer()
+ self.drop1 = nn.Dropout(drop_probs[0])
+ self.fc2 = nn.Linear(hidden_features, out_features, bias=bias[1])
+ self.drop2 = nn.Dropout(drop_probs[1])
+
+ def forward(self, x):
+ x = self.fc1(x)
+ x = self.act(x)
+ x = self.drop1(x)
+ x = self.fc2(x)
+ x = self.drop2(x)
+ return x
+
+
+class Attention(nn.Module):
+ def __init__(
+ self, dim, rope=None, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0
+ ):
+ super().__init__()
+ self.num_heads = num_heads
+ head_dim = dim // num_heads
+ self.scale = head_dim**-0.5
+ self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+ self.attn_drop = nn.Dropout(attn_drop)
+ self.proj = nn.Linear(dim, dim)
+ self.proj_drop = nn.Dropout(proj_drop)
+ self.rope = rope
+
+ def forward(self, x, xpos):
+ B, N, C = x.shape
+
+ qkv = (
+ self.qkv(x)
+ .reshape(B, N, 3, self.num_heads, C // self.num_heads)
+ .transpose(1, 3)
+ )
+ q, k, v = [qkv[:, :, i] for i in range(3)]
+ # q,k,v = qkv.unbind(2) # make torchscript happy (cannot use tensor as tuple)
+
+ if self.rope is not None:
+ q = self.rope(q, xpos)
+ k = self.rope(k, xpos)
+
+ attn = (q @ k.transpose(-2, -1)) * self.scale
+ attn = attn.softmax(dim=-1)
+ attn = self.attn_drop(attn)
+
+ x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+ x = self.proj(x)
+ x = self.proj_drop(x)
+ return x
+
+
+class Block(nn.Module):
+ def __init__(
+ self,
+ dim,
+ num_heads,
+ mlp_ratio=4.0,
+ qkv_bias=False,
+ drop=0.0,
+ attn_drop=0.0,
+ drop_path=0.0,
+ act_layer=nn.GELU,
+ norm_layer=nn.LayerNorm,
+ rope=None,
+ ):
+ super().__init__()
+ self.norm1 = norm_layer(dim)
+ self.attn = Attention(
+ dim,
+ rope=rope,
+ num_heads=num_heads,
+ qkv_bias=qkv_bias,
+ attn_drop=attn_drop,
+ proj_drop=drop,
+ )
+ # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+ self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+ self.norm2 = norm_layer(dim)
+ mlp_hidden_dim = int(dim * mlp_ratio)
+ self.mlp = Mlp(
+ in_features=dim,
+ hidden_features=mlp_hidden_dim,
+ act_layer=act_layer,
+ drop=drop,
+ )
+
+ def forward(self, x, xpos):
+ x = x + self.drop_path(self.attn(self.norm1(x), xpos))
+ x = x + self.drop_path(self.mlp(self.norm2(x)))
+ return x
+
+
+class CrossAttention(nn.Module):
+ def __init__(
+ self, dim, rope=None, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0
+ ):
+ super().__init__()
+ self.num_heads = num_heads
+ head_dim = dim // num_heads
+ self.scale = head_dim**-0.5
+
+ self.projq = nn.Linear(dim, dim, bias=qkv_bias)
+ self.projk = nn.Linear(dim, dim, bias=qkv_bias)
+ self.projv = nn.Linear(dim, dim, bias=qkv_bias)
+ self.attn_drop = nn.Dropout(attn_drop)
+ self.proj = nn.Linear(dim, dim)
+ self.proj_drop = nn.Dropout(proj_drop)
+
+ self.rope = rope
+
+ def forward(self, query, key, value, qpos, kpos):
+ B, Nq, C = query.shape
+ Nk = key.shape[1]
+ Nv = value.shape[1]
+
+ q = (
+ self.projq(query)
+ .reshape(B, Nq, self.num_heads, C // self.num_heads)
+ .permute(0, 2, 1, 3)
+ )
+ k = (
+ self.projk(key)
+ .reshape(B, Nk, self.num_heads, C // self.num_heads)
+ .permute(0, 2, 1, 3)
+ )
+ v = (
+ self.projv(value)
+ .reshape(B, Nv, self.num_heads, C // self.num_heads)
+ .permute(0, 2, 1, 3)
+ )
+
+ if self.rope is not None:
+ q = self.rope(q, qpos)
+ k = self.rope(k, kpos)
+
+ attn = (q @ k.transpose(-2, -1)) * self.scale
+ attn = attn.softmax(dim=-1)
+ attn = self.attn_drop(attn)
+
+ x = (attn @ v).transpose(1, 2).reshape(B, Nq, C)
+ x = self.proj(x)
+ x = self.proj_drop(x)
+ return x
+
+
+class DecoderBlock(nn.Module):
+ def __init__(
+ self,
+ dim,
+ num_heads,
+ mlp_ratio=4.0,
+ qkv_bias=False,
+ drop=0.0,
+ attn_drop=0.0,
+ drop_path=0.0,
+ act_layer=nn.GELU,
+ norm_layer=nn.LayerNorm,
+ norm_mem=True,
+ rope=None,
+ ):
+ super().__init__()
+ self.norm1 = norm_layer(dim)
+ self.attn = Attention(
+ dim,
+ rope=rope,
+ num_heads=num_heads,
+ qkv_bias=qkv_bias,
+ attn_drop=attn_drop,
+ proj_drop=drop,
+ )
+ self.cross_attn = CrossAttention(
+ dim,
+ rope=rope,
+ num_heads=num_heads,
+ qkv_bias=qkv_bias,
+ attn_drop=attn_drop,
+ proj_drop=drop,
+ )
+ self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+ self.norm2 = norm_layer(dim)
+ self.norm3 = norm_layer(dim)
+ mlp_hidden_dim = int(dim * mlp_ratio)
+ self.mlp = Mlp(
+ in_features=dim,
+ hidden_features=mlp_hidden_dim,
+ act_layer=act_layer,
+ drop=drop,
+ )
+ self.norm_y = norm_layer(dim) if norm_mem else nn.Identity()
+
+ def forward(self, x, y, xpos, ypos):
+ x = x + self.drop_path(self.attn(self.norm1(x), xpos))
+ y_ = self.norm_y(y)
+ x = x + self.drop_path(self.cross_attn(self.norm2(x), y_, y_, xpos, ypos))
+ x = x + self.drop_path(self.mlp(self.norm3(x)))
+ return x, y
+
+
+# patch embedding
+class PositionGetter(object):
+ """return positions of patches"""
+
+ def __init__(self):
+ self.cache_positions = {}
+
+ def __call__(self, b, h, w, device):
+ if not (h, w) in self.cache_positions:
+ x = torch.arange(w, device=device)
+ y = torch.arange(h, device=device)
+ self.cache_positions[h, w] = torch.cartesian_prod(y, x) # (h, w, 2)
+ pos = self.cache_positions[h, w].view(1, h * w, 2).expand(b, -1, 2).clone()
+ return pos
+
+
+class PatchEmbed(nn.Module):
+ """just adding _init_weights + position getter compared to timm.models.layers.patch_embed.PatchEmbed"""
+
+ def __init__(
+ self,
+ img_size=224,
+ patch_size=16,
+ in_chans=3,
+ embed_dim=768,
+ norm_layer=None,
+ flatten=True,
+ ):
+ super().__init__()
+ img_size = to_2tuple(img_size)
+ patch_size = to_2tuple(patch_size)
+ self.img_size = img_size
+ self.patch_size = patch_size
+ self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
+ self.num_patches = self.grid_size[0] * self.grid_size[1]
+ self.flatten = flatten
+
+ self.proj = nn.Conv2d(
+ in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
+ )
+ self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+
+ self.position_getter = PositionGetter()
+
+ def forward(self, x):
+ B, C, H, W = x.shape
+ torch._assert(
+ H == self.img_size[0],
+ f"Input image height ({H}) doesn't match model ({self.img_size[0]}).",
+ )
+ torch._assert(
+ W == self.img_size[1],
+ f"Input image width ({W}) doesn't match model ({self.img_size[1]}).",
+ )
+ x = self.proj(x)
+ pos = self.position_getter(B, x.size(2), x.size(3), x.device)
+ if self.flatten:
+ x = x.flatten(2).transpose(1, 2) # BCHW -> BNC
+ x = self.norm(x)
+ return x, pos
+
+ def _init_weights(self):
+ w = self.proj.weight.data
+ torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
diff --git a/third_party/dust3r/croco/models/criterion.py b/third_party/dust3r/croco/models/criterion.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcd7a26af15585161c85df23a895691b3f38e91e
--- /dev/null
+++ b/third_party/dust3r/croco/models/criterion.py
@@ -0,0 +1,36 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Criterion to train CroCo
+# --------------------------------------------------------
+# References:
+# MAE: https://github.com/facebookresearch/mae
+# --------------------------------------------------------
+
+import torch
+
+
+class MaskedMSE(torch.nn.Module):
+ def __init__(self, norm_pix_loss=False, masked=True):
+ """
+ norm_pix_loss: normalize each patch by their pixel mean and variance
+ masked: compute loss over the masked patches only
+ """
+ super().__init__()
+ self.norm_pix_loss = norm_pix_loss
+ self.masked = masked
+
+ def forward(self, pred, mask, target):
+ if self.norm_pix_loss:
+ mean = target.mean(dim=-1, keepdim=True)
+ var = target.var(dim=-1, keepdim=True)
+ target = (target - mean) / (var + 1.0e-6) ** 0.5
+
+ loss = (pred - target) ** 2
+ loss = loss.mean(dim=-1) # [N, L], mean loss per patch
+ if self.masked:
+ loss = (loss * mask).sum() / mask.sum() # mean loss on masked patches
+ else:
+ loss = loss.mean() # mean loss
+ return loss
diff --git a/third_party/dust3r/croco/models/croco.py b/third_party/dust3r/croco/models/croco.py
new file mode 100644
index 0000000000000000000000000000000000000000..2de6d96f2ad9daa790580df598701d692f48a71e
--- /dev/null
+++ b/third_party/dust3r/croco/models/croco.py
@@ -0,0 +1,297 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+
+# --------------------------------------------------------
+# CroCo model during pretraining
+# --------------------------------------------------------
+
+
+import torch
+import torch.nn as nn
+
+torch.backends.cuda.matmul.allow_tf32 = True # for gpu >= Ampere and pytorch >= 1.12
+from functools import partial
+
+from models.blocks import Block, DecoderBlock, PatchEmbed
+from models.masking import RandomMask
+from models.pos_embed import RoPE2D, get_2d_sincos_pos_embed
+
+
+class CroCoNet(nn.Module):
+ def __init__(
+ self,
+ img_size=224, # input image size
+ patch_size=16, # patch_size
+ mask_ratio=0.9, # ratios of masked tokens
+ enc_embed_dim=768, # encoder feature dimension
+ enc_depth=12, # encoder depth
+ enc_num_heads=12, # encoder number of heads in the transformer block
+ dec_embed_dim=512, # decoder feature dimension
+ dec_depth=8, # decoder depth
+ dec_num_heads=16, # decoder number of heads in the transformer block
+ mlp_ratio=4,
+ norm_layer=partial(nn.LayerNorm, eps=1e-6),
+ norm_im2_in_dec=True, # whether to apply normalization of the 'memory' = (second image) in the decoder
+ pos_embed="cosine", # positional embedding (either cosine or RoPE100)
+ ):
+ super(CroCoNet, self).__init__()
+
+ # patch embeddings (with initialization done as in MAE)
+ self._set_patch_embed(img_size, patch_size, enc_embed_dim)
+
+ # mask generations
+ self._set_mask_generator(self.patch_embed.num_patches, mask_ratio)
+
+ self.pos_embed = pos_embed
+ if pos_embed == "cosine":
+ # positional embedding of the encoder
+ enc_pos_embed = get_2d_sincos_pos_embed(
+ enc_embed_dim, int(self.patch_embed.num_patches**0.5), n_cls_token=0
+ )
+ self.register_buffer(
+ "enc_pos_embed", torch.from_numpy(enc_pos_embed).float()
+ )
+ # positional embedding of the decoder
+ dec_pos_embed = get_2d_sincos_pos_embed(
+ dec_embed_dim, int(self.patch_embed.num_patches**0.5), n_cls_token=0
+ )
+ self.register_buffer(
+ "dec_pos_embed", torch.from_numpy(dec_pos_embed).float()
+ )
+ # pos embedding in each block
+ self.rope = None # nothing for cosine
+ elif pos_embed.startswith("RoPE"): # eg RoPE100
+ self.enc_pos_embed = None # nothing to add in the encoder with RoPE
+ self.dec_pos_embed = None # nothing to add in the decoder with RoPE
+ if RoPE2D is None:
+ raise ImportError(
+ "Cannot find cuRoPE2D, please install it following the README instructions"
+ )
+ freq = float(pos_embed[len("RoPE") :])
+ self.rope = RoPE2D(freq=freq)
+ else:
+ raise NotImplementedError("Unknown pos_embed " + pos_embed)
+
+ # transformer for the encoder
+ self.enc_depth = enc_depth
+ self.enc_embed_dim = enc_embed_dim
+ self.enc_blocks = nn.ModuleList(
+ [
+ Block(
+ enc_embed_dim,
+ enc_num_heads,
+ mlp_ratio,
+ qkv_bias=True,
+ norm_layer=norm_layer,
+ rope=self.rope,
+ )
+ for i in range(enc_depth)
+ ]
+ )
+ self.enc_norm = norm_layer(enc_embed_dim)
+
+ # masked tokens
+ self._set_mask_token(dec_embed_dim)
+
+ # decoder
+ self._set_decoder(
+ enc_embed_dim,
+ dec_embed_dim,
+ dec_num_heads,
+ dec_depth,
+ mlp_ratio,
+ norm_layer,
+ norm_im2_in_dec,
+ )
+
+ # prediction head
+ self._set_prediction_head(dec_embed_dim, patch_size)
+
+ # initializer weights
+ self.initialize_weights()
+
+ def _set_patch_embed(self, img_size=224, patch_size=16, enc_embed_dim=768):
+ self.patch_embed = PatchEmbed(img_size, patch_size, 3, enc_embed_dim)
+
+ def _set_mask_generator(self, num_patches, mask_ratio):
+ self.mask_generator = RandomMask(num_patches, mask_ratio)
+
+ def _set_mask_token(self, dec_embed_dim):
+ self.mask_token = nn.Parameter(torch.zeros(1, 1, dec_embed_dim))
+
+ def _set_decoder(
+ self,
+ enc_embed_dim,
+ dec_embed_dim,
+ dec_num_heads,
+ dec_depth,
+ mlp_ratio,
+ norm_layer,
+ norm_im2_in_dec,
+ ):
+ self.dec_depth = dec_depth
+ self.dec_embed_dim = dec_embed_dim
+ # transfer from encoder to decoder
+ self.decoder_embed = nn.Linear(enc_embed_dim, dec_embed_dim, bias=True)
+ # transformer for the decoder
+ self.dec_blocks = nn.ModuleList(
+ [
+ DecoderBlock(
+ dec_embed_dim,
+ dec_num_heads,
+ mlp_ratio=mlp_ratio,
+ qkv_bias=True,
+ norm_layer=norm_layer,
+ norm_mem=norm_im2_in_dec,
+ rope=self.rope,
+ )
+ for i in range(dec_depth)
+ ]
+ )
+ # final norm layer
+ self.dec_norm = norm_layer(dec_embed_dim)
+
+ def _set_prediction_head(self, dec_embed_dim, patch_size):
+ self.prediction_head = nn.Linear(dec_embed_dim, patch_size**2 * 3, bias=True)
+
+ def initialize_weights(self):
+ # patch embed
+ self.patch_embed._init_weights()
+ # mask tokens
+ if self.mask_token is not None:
+ torch.nn.init.normal_(self.mask_token, std=0.02)
+ # linears and layer norms
+ self.apply(self._init_weights)
+
+ def _init_weights(self, m):
+ if isinstance(m, nn.Linear):
+ # we use xavier_uniform following official JAX ViT:
+ torch.nn.init.xavier_uniform_(m.weight)
+ if isinstance(m, nn.Linear) and m.bias is not None:
+ nn.init.constant_(m.bias, 0)
+ elif isinstance(m, nn.LayerNorm):
+ nn.init.constant_(m.bias, 0)
+ nn.init.constant_(m.weight, 1.0)
+
+ def _encode_image(self, image, do_mask=False, return_all_blocks=False):
+ """
+ image has B x 3 x img_size x img_size
+ do_mask: whether to perform masking or not
+ return_all_blocks: if True, return the features at the end of every block
+ instead of just the features from the last block (eg for some prediction heads)
+ """
+ # embed the image into patches (x has size B x Npatches x C)
+ # and get position if each return patch (pos has size B x Npatches x 2)
+ x, pos = self.patch_embed(image)
+ # add positional embedding without cls token
+ if self.enc_pos_embed is not None:
+ x = x + self.enc_pos_embed[None, ...]
+ # apply masking
+ B, N, C = x.size()
+ if do_mask:
+ masks = self.mask_generator(x)
+ x = x[~masks].view(B, -1, C)
+ posvis = pos[~masks].view(B, -1, 2)
+ else:
+ B, N, C = x.size()
+ masks = torch.zeros((B, N), dtype=bool)
+ posvis = pos
+ # now apply the transformer encoder and normalization
+ if return_all_blocks:
+ out = []
+ for blk in self.enc_blocks:
+ x = blk(x, posvis)
+ out.append(x)
+ out[-1] = self.enc_norm(out[-1])
+ return out, pos, masks
+ else:
+ for blk in self.enc_blocks:
+ x = blk(x, posvis)
+ x = self.enc_norm(x)
+ return x, pos, masks
+
+ def _decoder(self, feat1, pos1, masks1, feat2, pos2, return_all_blocks=False):
+ """
+ return_all_blocks: if True, return the features at the end of every block
+ instead of just the features from the last block (eg for some prediction heads)
+
+ masks1 can be None => assume image1 fully visible
+ """
+ # encoder to decoder layer
+ visf1 = self.decoder_embed(feat1)
+ f2 = self.decoder_embed(feat2)
+ # append masked tokens to the sequence
+ B, Nenc, C = visf1.size()
+ if masks1 is None: # downstreams
+ f1_ = visf1
+ else: # pretraining
+ Ntotal = masks1.size(1)
+ f1_ = self.mask_token.repeat(B, Ntotal, 1).to(dtype=visf1.dtype)
+ f1_[~masks1] = visf1.view(B * Nenc, C)
+ # add positional embedding
+ if self.dec_pos_embed is not None:
+ f1_ = f1_ + self.dec_pos_embed
+ f2 = f2 + self.dec_pos_embed
+ # apply Transformer blocks
+ out = f1_
+ out2 = f2
+ if return_all_blocks:
+ _out, out = out, []
+ for blk in self.dec_blocks:
+ _out, out2 = blk(_out, out2, pos1, pos2)
+ out.append(_out)
+ out[-1] = self.dec_norm(out[-1])
+ else:
+ for blk in self.dec_blocks:
+ out, out2 = blk(out, out2, pos1, pos2)
+ out = self.dec_norm(out)
+ return out
+
+ def patchify(self, imgs):
+ """
+ imgs: (B, 3, H, W)
+ x: (B, L, patch_size**2 *3)
+ """
+ p = self.patch_embed.patch_size[0]
+ assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+
+ h = w = imgs.shape[2] // p
+ x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
+ x = torch.einsum("nchpwq->nhwpqc", x)
+ x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * 3))
+
+ return x
+
+ def unpatchify(self, x, channels=3):
+ """
+ x: (N, L, patch_size**2 *channels)
+ imgs: (N, 3, H, W)
+ """
+ patch_size = self.patch_embed.patch_size[0]
+ h = w = int(x.shape[1] ** 0.5)
+ assert h * w == x.shape[1]
+ x = x.reshape(shape=(x.shape[0], h, w, patch_size, patch_size, channels))
+ x = torch.einsum("nhwpqc->nchpwq", x)
+ imgs = x.reshape(shape=(x.shape[0], channels, h * patch_size, h * patch_size))
+ return imgs
+
+ def forward(self, img1, img2):
+ """
+ img1: tensor of size B x 3 x img_size x img_size
+ img2: tensor of size B x 3 x img_size x img_size
+
+ out will be B x N x (3*patch_size*patch_size)
+ masks are also returned as B x N just in case
+ """
+ # encoder of the masked first image
+ feat1, pos1, mask1 = self._encode_image(img1, do_mask=True)
+ # encoder of the second image
+ feat2, pos2, _ = self._encode_image(img2, do_mask=False)
+ # decoder
+ decfeat = self._decoder(feat1, pos1, mask1, feat2, pos2)
+ # prediction head
+ out = self.prediction_head(decfeat)
+ # get target
+ target = self.patchify(img1)
+ return out, mask1, target
diff --git a/third_party/dust3r/croco/models/croco_downstream.py b/third_party/dust3r/croco/models/croco_downstream.py
new file mode 100644
index 0000000000000000000000000000000000000000..39d263a09668ce89888f6667bd5fe7a9d3739e58
--- /dev/null
+++ b/third_party/dust3r/croco/models/croco_downstream.py
@@ -0,0 +1,139 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+# --------------------------------------------------------
+# CroCo model for downstream tasks
+# --------------------------------------------------------
+
+import torch
+
+from .croco import CroCoNet
+
+
+def croco_args_from_ckpt(ckpt):
+ if "croco_kwargs" in ckpt: # CroCo v2 released models
+ return ckpt["croco_kwargs"]
+ elif "args" in ckpt and hasattr(
+ ckpt["args"], "model"
+ ): # pretrained using the official code release
+ s = ckpt[
+ "args"
+ ].model # eg "CroCoNet(enc_embed_dim=1024, enc_num_heads=16, enc_depth=24)"
+ assert s.startswith("CroCoNet(")
+ return eval(
+ "dict" + s[len("CroCoNet") :]
+ ) # transform it into the string of a dictionary and evaluate it
+ else: # CroCo v1 released models
+ return dict()
+
+
+class CroCoDownstreamMonocularEncoder(CroCoNet):
+ def __init__(self, head, **kwargs):
+ """Build network for monocular downstream task, only using the encoder.
+ It takes an extra argument head, that is called with the features
+ and a dictionary img_info containing 'width' and 'height' keys
+ The head is setup with the croconet arguments in this init function
+ NOTE: It works by *calling super().__init__() but with redefined setters
+
+ """
+ super(CroCoDownstreamMonocularEncoder, self).__init__(**kwargs)
+ head.setup(self)
+ self.head = head
+
+ def _set_mask_generator(self, *args, **kwargs):
+ """No mask generator"""
+ return
+
+ def _set_mask_token(self, *args, **kwargs):
+ """No mask token"""
+ self.mask_token = None
+ return
+
+ def _set_decoder(self, *args, **kwargs):
+ """No decoder"""
+ return
+
+ def _set_prediction_head(self, *args, **kwargs):
+ """No 'prediction head' for downstream tasks."""
+ return
+
+ def forward(self, img):
+ """
+ img if of size batch_size x 3 x h x w
+ """
+ B, C, H, W = img.size()
+ img_info = {"height": H, "width": W}
+ need_all_layers = (
+ hasattr(self.head, "return_all_blocks") and self.head.return_all_blocks
+ )
+ out, _, _ = self._encode_image(
+ img, do_mask=False, return_all_blocks=need_all_layers
+ )
+ return self.head(out, img_info)
+
+
+class CroCoDownstreamBinocular(CroCoNet):
+ def __init__(self, head, **kwargs):
+ """Build network for binocular downstream task
+ It takes an extra argument head, that is called with the features
+ and a dictionary img_info containing 'width' and 'height' keys
+ The head is setup with the croconet arguments in this init function
+ """
+ super(CroCoDownstreamBinocular, self).__init__(**kwargs)
+ head.setup(self)
+ self.head = head
+
+ def _set_mask_generator(self, *args, **kwargs):
+ """No mask generator"""
+ return
+
+ def _set_mask_token(self, *args, **kwargs):
+ """No mask token"""
+ self.mask_token = None
+ return
+
+ def _set_prediction_head(self, *args, **kwargs):
+ """No prediction head for downstream tasks, define your own head"""
+ return
+
+ def encode_image_pairs(self, img1, img2, return_all_blocks=False):
+ """run encoder for a pair of images
+ it is actually ~5% faster to concatenate the images along the batch dimension
+ than to encode them separately
+ """
+ ## the two commented lines below is the naive version with separate encoding
+ # out, pos, _ = self._encode_image(img1, do_mask=False, return_all_blocks=return_all_blocks)
+ # out2, pos2, _ = self._encode_image(img2, do_mask=False, return_all_blocks=False)
+ ## and now the faster version
+ out, pos, _ = self._encode_image(
+ torch.cat((img1, img2), dim=0),
+ do_mask=False,
+ return_all_blocks=return_all_blocks,
+ )
+ if return_all_blocks:
+ out, out2 = list(map(list, zip(*[o.chunk(2, dim=0) for o in out])))
+ out2 = out2[-1]
+ else:
+ out, out2 = out.chunk(2, dim=0)
+ pos, pos2 = pos.chunk(2, dim=0)
+ return out, out2, pos, pos2
+
+ def forward(self, img1, img2):
+ B, C, H, W = img1.size()
+ img_info = {"height": H, "width": W}
+ return_all_blocks = (
+ hasattr(self.head, "return_all_blocks") and self.head.return_all_blocks
+ )
+ out, out2, pos, pos2 = self.encode_image_pairs(
+ img1, img2, return_all_blocks=return_all_blocks
+ )
+ if return_all_blocks:
+ decout = self._decoder(
+ out[-1], pos, None, out2, pos2, return_all_blocks=return_all_blocks
+ )
+ decout = out + decout
+ else:
+ decout = self._decoder(
+ out, pos, None, out2, pos2, return_all_blocks=return_all_blocks
+ )
+ return self.head(decout, img_info)
diff --git a/third_party/dust3r/croco/models/curope/__init__.py b/third_party/dust3r/croco/models/curope/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..25e3d48a162760260826080f6366838e83e26878
--- /dev/null
+++ b/third_party/dust3r/croco/models/curope/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+from .curope2d import cuRoPE2D
diff --git a/third_party/dust3r/croco/models/curope/curope.cpp b/third_party/dust3r/croco/models/curope/curope.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..8fe9058e05aa1bf3f37b0d970edc7312bc68455b
--- /dev/null
+++ b/third_party/dust3r/croco/models/curope/curope.cpp
@@ -0,0 +1,69 @@
+/*
+ Copyright (C) 2022-present Naver Corporation. All rights reserved.
+ Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+*/
+
+#include
+
+// forward declaration
+void rope_2d_cuda( torch::Tensor tokens, const torch::Tensor pos, const float base, const float fwd );
+
+void rope_2d_cpu( torch::Tensor tokens, const torch::Tensor positions, const float base, const float fwd )
+{
+ const int B = tokens.size(0);
+ const int N = tokens.size(1);
+ const int H = tokens.size(2);
+ const int D = tokens.size(3) / 4;
+
+ auto tok = tokens.accessor();
+ auto pos = positions.accessor();
+
+ for (int b = 0; b < B; b++) {
+ for (int x = 0; x < 2; x++) { // y and then x (2d)
+ for (int n = 0; n < N; n++) {
+
+ // grab the token position
+ const int p = pos[b][n][x];
+
+ for (int h = 0; h < H; h++) {
+ for (int d = 0; d < D; d++) {
+ // grab the two values
+ float u = tok[b][n][h][d+0+x*2*D];
+ float v = tok[b][n][h][d+D+x*2*D];
+
+ // grab the cos,sin
+ const float inv_freq = fwd * p / powf(base, d/float(D));
+ float c = cosf(inv_freq);
+ float s = sinf(inv_freq);
+
+ // write the result
+ tok[b][n][h][d+0+x*2*D] = u*c - v*s;
+ tok[b][n][h][d+D+x*2*D] = v*c + u*s;
+ }
+ }
+ }
+ }
+ }
+}
+
+void rope_2d( torch::Tensor tokens, // B,N,H,D
+ const torch::Tensor positions, // B,N,2
+ const float base,
+ const float fwd )
+{
+ TORCH_CHECK(tokens.dim() == 4, "tokens must have 4 dimensions");
+ TORCH_CHECK(positions.dim() == 3, "positions must have 3 dimensions");
+ TORCH_CHECK(tokens.size(0) == positions.size(0), "batch size differs between tokens & positions");
+ TORCH_CHECK(tokens.size(1) == positions.size(1), "seq_length differs between tokens & positions");
+ TORCH_CHECK(positions.size(2) == 2, "positions.shape[2] must be equal to 2");
+ TORCH_CHECK(tokens.is_cuda() == positions.is_cuda(), "tokens and positions are not on the same device" );
+
+ if (tokens.is_cuda())
+ rope_2d_cuda( tokens, positions, base, fwd );
+ else
+ rope_2d_cpu( tokens, positions, base, fwd );
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+ m.def("rope_2d", &rope_2d, "RoPE 2d forward/backward");
+}
diff --git a/third_party/dust3r/croco/models/curope/curope2d.py b/third_party/dust3r/croco/models/curope/curope2d.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7272b8f03977ab41204afda489df5dd920dad79
--- /dev/null
+++ b/third_party/dust3r/croco/models/curope/curope2d.py
@@ -0,0 +1,39 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+import torch
+
+try:
+ import curope as _kernels # run `python setup.py install`
+except ModuleNotFoundError:
+ from . import curope as _kernels # run `python setup.py build_ext --inplace`
+
+
+class cuRoPE2D_func(torch.autograd.Function):
+ @staticmethod
+ def forward(ctx, tokens, positions, base, F0=1):
+ ctx.save_for_backward(positions)
+ ctx.saved_base = base
+ ctx.saved_F0 = F0
+ # tokens = tokens.clone() # uncomment this if inplace doesn't work
+ _kernels.rope_2d(tokens, positions, base, F0)
+ ctx.mark_dirty(tokens)
+ return tokens
+
+ @staticmethod
+ def backward(ctx, grad_res):
+ positions, base, F0 = ctx.saved_tensors[0], ctx.saved_base, ctx.saved_F0
+ _kernels.rope_2d(grad_res, positions, base, -F0)
+ ctx.mark_dirty(grad_res)
+ return grad_res, None, None, None
+
+
+class cuRoPE2D(torch.nn.Module):
+ def __init__(self, freq=100.0, F0=1.0):
+ super().__init__()
+ self.base = freq
+ self.F0 = F0
+
+ def forward(self, tokens, positions):
+ cuRoPE2D_func.apply(tokens.transpose(1, 2), positions, self.base, self.F0)
+ return tokens
diff --git a/third_party/dust3r/croco/models/curope/kernels.cu b/third_party/dust3r/croco/models/curope/kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..7156cd1bb935cb1f0be45e58add53f9c21505c20
--- /dev/null
+++ b/third_party/dust3r/croco/models/curope/kernels.cu
@@ -0,0 +1,108 @@
+/*
+ Copyright (C) 2022-present Naver Corporation. All rights reserved.
+ Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+*/
+
+#include
+#include
+#include
+#include
+
+#define CHECK_CUDA(tensor) {\
+ TORCH_CHECK((tensor).is_cuda(), #tensor " is not in cuda memory"); \
+ TORCH_CHECK((tensor).is_contiguous(), #tensor " is not contiguous"); }
+void CHECK_KERNEL() {auto error = cudaGetLastError(); TORCH_CHECK( error == cudaSuccess, cudaGetErrorString(error));}
+
+
+template < typename scalar_t >
+__global__ void rope_2d_cuda_kernel(
+ //scalar_t* __restrict__ tokens,
+ torch::PackedTensorAccessor32 tokens,
+ const int64_t* __restrict__ pos,
+ const float base,
+ const float fwd )
+ // const int N, const int H, const int D )
+{
+ // tokens shape = (B, N, H, D)
+ const int N = tokens.size(1);
+ const int H = tokens.size(2);
+ const int D = tokens.size(3);
+
+ // each block update a single token, for all heads
+ // each thread takes care of a single output
+ extern __shared__ float shared[];
+ float* shared_inv_freq = shared + D;
+
+ const int b = blockIdx.x / N;
+ const int n = blockIdx.x % N;
+
+ const int Q = D / 4;
+ // one token = [0..Q : Q..2Q : 2Q..3Q : 3Q..D]
+ // u_Y v_Y u_X v_X
+
+ // shared memory: first, compute inv_freq
+ if (threadIdx.x < Q)
+ shared_inv_freq[threadIdx.x] = fwd / powf(base, threadIdx.x/float(Q));
+ __syncthreads();
+
+ // start of X or Y part
+ const int X = threadIdx.x < D/2 ? 0 : 1;
+ const int m = (X*D/2) + (threadIdx.x % Q); // index of u_Y or u_X
+
+ // grab the cos,sin appropriate for me
+ const float freq = pos[blockIdx.x*2+X] * shared_inv_freq[threadIdx.x % Q];
+ const float cos = cosf(freq);
+ const float sin = sinf(freq);
+ /*
+ float* shared_cos_sin = shared + D + D/4;
+ if ((threadIdx.x % (D/2)) < Q)
+ shared_cos_sin[m+0] = cosf(freq);
+ else
+ shared_cos_sin[m+Q] = sinf(freq);
+ __syncthreads();
+ const float cos = shared_cos_sin[m+0];
+ const float sin = shared_cos_sin[m+Q];
+ */
+
+ for (int h = 0; h < H; h++)
+ {
+ // then, load all the token for this head in shared memory
+ shared[threadIdx.x] = tokens[b][n][h][threadIdx.x];
+ __syncthreads();
+
+ const float u = shared[m];
+ const float v = shared[m+Q];
+
+ // write output
+ if ((threadIdx.x % (D/2)) < Q)
+ tokens[b][n][h][threadIdx.x] = u*cos - v*sin;
+ else
+ tokens[b][n][h][threadIdx.x] = v*cos + u*sin;
+ }
+}
+
+void rope_2d_cuda( torch::Tensor tokens, const torch::Tensor pos, const float base, const float fwd )
+{
+ const int B = tokens.size(0); // batch size
+ const int N = tokens.size(1); // sequence length
+ const int H = tokens.size(2); // number of heads
+ const int D = tokens.size(3); // dimension per head
+
+ TORCH_CHECK(tokens.stride(3) == 1 && tokens.stride(2) == D, "tokens are not contiguous");
+ TORCH_CHECK(pos.is_contiguous(), "positions are not contiguous");
+ TORCH_CHECK(pos.size(0) == B && pos.size(1) == N && pos.size(2) == 2, "bad pos.shape");
+ TORCH_CHECK(D % 4 == 0, "token dim must be multiple of 4");
+
+ // one block for each layer, one thread per local-max
+ const int THREADS_PER_BLOCK = D;
+ const int N_BLOCKS = B * N; // each block takes care of H*D values
+ const int SHARED_MEM = sizeof(float) * (D + D/4);
+
+ AT_DISPATCH_FLOATING_TYPES_AND_HALF(tokens.type(), "rope_2d_cuda", ([&] {
+ rope_2d_cuda_kernel <<>> (
+ //tokens.data_ptr(),
+ tokens.packed_accessor32(),
+ pos.data_ptr(),
+ base, fwd); //, N, H, D );
+ }));
+}
diff --git a/third_party/dust3r/croco/models/curope/setup.py b/third_party/dust3r/croco/models/curope/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..02ddb0912370a67a49fd2bb91164cf2f1da8648e
--- /dev/null
+++ b/third_party/dust3r/croco/models/curope/setup.py
@@ -0,0 +1,34 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+from setuptools import setup
+from torch import cuda
+from torch.utils.cpp_extension import BuildExtension, CUDAExtension
+
+# compile for all possible CUDA architectures
+all_cuda_archs = cuda.get_gencode_flags().replace("compute=", "arch=").split()
+# alternatively, you can list cuda archs that you want, eg:
+# all_cuda_archs = [
+# '-gencode', 'arch=compute_70,code=sm_70',
+# '-gencode', 'arch=compute_75,code=sm_75',
+# '-gencode', 'arch=compute_80,code=sm_80',
+# '-gencode', 'arch=compute_86,code=sm_86'
+# ]
+
+setup(
+ name="curope",
+ ext_modules=[
+ CUDAExtension(
+ name="curope",
+ sources=[
+ "curope.cpp",
+ "kernels.cu",
+ ],
+ extra_compile_args=dict(
+ nvcc=["-O3", "--ptxas-options=-v", "--use_fast_math"] + all_cuda_archs,
+ cxx=["-O3"],
+ ),
+ )
+ ],
+ cmdclass={"build_ext": BuildExtension},
+)
diff --git a/third_party/dust3r/croco/models/dpt_block.py b/third_party/dust3r/croco/models/dpt_block.py
new file mode 100644
index 0000000000000000000000000000000000000000..72541f0d716f7135d807d58d222d6b4b67472c5e
--- /dev/null
+++ b/third_party/dust3r/croco/models/dpt_block.py
@@ -0,0 +1,514 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+# --------------------------------------------------------
+# DPT head for ViTs
+# --------------------------------------------------------
+# References:
+# https://github.com/isl-org/DPT
+# https://github.com/EPFL-VILAB/MultiMAE/blob/main/multimae/output_adapters.py
+
+from typing import Dict, Iterable, List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+
+
+def pair(t):
+ return t if isinstance(t, tuple) else (t, t)
+
+
+def make_scratch(in_shape, out_shape, groups=1, expand=False):
+ scratch = nn.Module()
+
+ out_shape1 = out_shape
+ out_shape2 = out_shape
+ out_shape3 = out_shape
+ out_shape4 = out_shape
+ if expand == True:
+ out_shape1 = out_shape
+ out_shape2 = out_shape * 2
+ out_shape3 = out_shape * 4
+ out_shape4 = out_shape * 8
+
+ scratch.layer1_rn = nn.Conv2d(
+ in_shape[0],
+ out_shape1,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=False,
+ groups=groups,
+ )
+ scratch.layer2_rn = nn.Conv2d(
+ in_shape[1],
+ out_shape2,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=False,
+ groups=groups,
+ )
+ scratch.layer3_rn = nn.Conv2d(
+ in_shape[2],
+ out_shape3,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=False,
+ groups=groups,
+ )
+ scratch.layer4_rn = nn.Conv2d(
+ in_shape[3],
+ out_shape4,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=False,
+ groups=groups,
+ )
+
+ scratch.layer_rn = nn.ModuleList(
+ [
+ scratch.layer1_rn,
+ scratch.layer2_rn,
+ scratch.layer3_rn,
+ scratch.layer4_rn,
+ ]
+ )
+
+ return scratch
+
+
+class ResidualConvUnit_custom(nn.Module):
+ """Residual convolution module."""
+
+ def __init__(self, features, activation, bn):
+ """Init.
+ Args:
+ features (int): number of features
+ """
+ super().__init__()
+
+ self.bn = bn
+
+ self.groups = 1
+
+ self.conv1 = nn.Conv2d(
+ features,
+ features,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=not self.bn,
+ groups=self.groups,
+ )
+
+ self.conv2 = nn.Conv2d(
+ features,
+ features,
+ kernel_size=3,
+ stride=1,
+ padding=1,
+ bias=not self.bn,
+ groups=self.groups,
+ )
+
+ if self.bn == True:
+ self.bn1 = nn.BatchNorm2d(features)
+ self.bn2 = nn.BatchNorm2d(features)
+
+ self.activation = activation
+
+ self.skip_add = nn.quantized.FloatFunctional()
+
+ def forward(self, x):
+ """Forward pass.
+ Args:
+ x (tensor): input
+ Returns:
+ tensor: output
+ """
+
+ out = self.activation(x)
+ out = self.conv1(out)
+ if self.bn == True:
+ out = self.bn1(out)
+
+ out = self.activation(out)
+ out = self.conv2(out)
+ if self.bn == True:
+ out = self.bn2(out)
+
+ if self.groups > 1:
+ out = self.conv_merge(out)
+
+ return self.skip_add.add(out, x)
+
+
+class FeatureFusionBlock_custom(nn.Module):
+ """Feature fusion block."""
+
+ def __init__(
+ self,
+ features,
+ activation,
+ deconv=False,
+ bn=False,
+ expand=False,
+ align_corners=True,
+ width_ratio=1,
+ ):
+ """Init.
+ Args:
+ features (int): number of features
+ """
+ super(FeatureFusionBlock_custom, self).__init__()
+ self.width_ratio = width_ratio
+
+ self.deconv = deconv
+ self.align_corners = align_corners
+
+ self.groups = 1
+
+ self.expand = expand
+ out_features = features
+ if self.expand == True:
+ out_features = features // 2
+
+ self.out_conv = nn.Conv2d(
+ features,
+ out_features,
+ kernel_size=1,
+ stride=1,
+ padding=0,
+ bias=True,
+ groups=1,
+ )
+
+ self.resConfUnit1 = ResidualConvUnit_custom(features, activation, bn)
+ self.resConfUnit2 = ResidualConvUnit_custom(features, activation, bn)
+
+ self.skip_add = nn.quantized.FloatFunctional()
+
+ def forward(self, *xs):
+ """Forward pass.
+ Returns:
+ tensor: output
+ """
+ output = xs[0]
+
+ if len(xs) == 2:
+ res = self.resConfUnit1(xs[1])
+ if self.width_ratio != 1:
+ res = F.interpolate(
+ res, size=(output.shape[2], output.shape[3]), mode="bilinear"
+ )
+
+ output = self.skip_add.add(output, res)
+ # output += res
+
+ output = self.resConfUnit2(output)
+
+ if self.width_ratio != 1:
+ # and output.shape[3] < self.width_ratio * output.shape[2]
+ # size=(image.shape[])
+ if (output.shape[3] / output.shape[2]) < (2 / 3) * self.width_ratio:
+ shape = 3 * output.shape[3]
+ else:
+ shape = int(self.width_ratio * 2 * output.shape[2])
+ output = F.interpolate(
+ output, size=(2 * output.shape[2], shape), mode="bilinear"
+ )
+ else:
+ output = nn.functional.interpolate(
+ output,
+ scale_factor=2,
+ mode="bilinear",
+ align_corners=self.align_corners,
+ )
+ output = self.out_conv(output)
+ return output
+
+
+def make_fusion_block(features, use_bn, width_ratio=1):
+ return FeatureFusionBlock_custom(
+ features,
+ nn.ReLU(False),
+ deconv=False,
+ bn=use_bn,
+ expand=False,
+ align_corners=True,
+ width_ratio=width_ratio,
+ )
+
+
+class Interpolate(nn.Module):
+ """Interpolation module."""
+
+ def __init__(self, scale_factor, mode, align_corners=False):
+ """Init.
+ Args:
+ scale_factor (float): scaling
+ mode (str): interpolation mode
+ """
+ super(Interpolate, self).__init__()
+
+ self.interp = nn.functional.interpolate
+ self.scale_factor = scale_factor
+ self.mode = mode
+ self.align_corners = align_corners
+
+ def forward(self, x):
+ """Forward pass.
+ Args:
+ x (tensor): input
+ Returns:
+ tensor: interpolated data
+ """
+
+ x = self.interp(
+ x,
+ scale_factor=self.scale_factor,
+ mode=self.mode,
+ align_corners=self.align_corners,
+ )
+
+ return x
+
+
+class DPTOutputAdapter(nn.Module):
+ """DPT output adapter.
+
+ :param num_cahnnels: Number of output channels
+ :param stride_level: tride level compared to the full-sized image.
+ E.g. 4 for 1/4th the size of the image.
+ :param patch_size_full: Int or tuple of the patch size over the full image size.
+ Patch size for smaller inputs will be computed accordingly.
+ :param hooks: Index of intermediate layers
+ :param layer_dims: Dimension of intermediate layers
+ :param feature_dim: Feature dimension
+ :param last_dim: out_channels/in_channels for the last two Conv2d when head_type == regression
+ :param use_bn: If set to True, activates batch norm
+ :param dim_tokens_enc: Dimension of tokens coming from encoder
+ """
+
+ def __init__(
+ self,
+ num_channels: int = 1,
+ stride_level: int = 1,
+ patch_size: Union[int, Tuple[int, int]] = 16,
+ main_tasks: Iterable[str] = ("rgb",),
+ hooks: List[int] = [2, 5, 8, 11],
+ layer_dims: List[int] = [96, 192, 384, 768],
+ feature_dim: int = 256,
+ last_dim: int = 32,
+ use_bn: bool = False,
+ dim_tokens_enc: Optional[int] = None,
+ head_type: str = "regression",
+ output_width_ratio=1,
+ **kwargs
+ ):
+ super().__init__()
+ self.num_channels = num_channels
+ self.stride_level = stride_level
+ self.patch_size = pair(patch_size)
+ self.main_tasks = main_tasks
+ self.hooks = hooks
+ self.layer_dims = layer_dims
+ self.feature_dim = feature_dim
+ self.dim_tokens_enc = (
+ dim_tokens_enc * len(self.main_tasks)
+ if dim_tokens_enc is not None
+ else None
+ )
+ self.head_type = head_type
+
+ # Actual patch height and width, taking into account stride of input
+ self.P_H = max(1, self.patch_size[0] // stride_level)
+ self.P_W = max(1, self.patch_size[1] // stride_level)
+
+ self.scratch = make_scratch(layer_dims, feature_dim, groups=1, expand=False)
+
+ self.scratch.refinenet1 = make_fusion_block(
+ feature_dim, use_bn, output_width_ratio
+ )
+ self.scratch.refinenet2 = make_fusion_block(
+ feature_dim, use_bn, output_width_ratio
+ )
+ self.scratch.refinenet3 = make_fusion_block(
+ feature_dim, use_bn, output_width_ratio
+ )
+ self.scratch.refinenet4 = make_fusion_block(
+ feature_dim, use_bn, output_width_ratio
+ )
+
+ if self.head_type == "regression":
+ # The "DPTDepthModel" head
+ self.head = nn.Sequential(
+ nn.Conv2d(
+ feature_dim, feature_dim // 2, kernel_size=3, stride=1, padding=1
+ ),
+ Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
+ nn.Conv2d(
+ feature_dim // 2, last_dim, kernel_size=3, stride=1, padding=1
+ ),
+ nn.ReLU(True),
+ nn.Conv2d(
+ last_dim, self.num_channels, kernel_size=1, stride=1, padding=0
+ ),
+ )
+ elif self.head_type == "semseg":
+ # The "DPTSegmentationModel" head
+ self.head = nn.Sequential(
+ nn.Conv2d(
+ feature_dim, feature_dim, kernel_size=3, padding=1, bias=False
+ ),
+ nn.BatchNorm2d(feature_dim) if use_bn else nn.Identity(),
+ nn.ReLU(True),
+ nn.Dropout(0.1, False),
+ nn.Conv2d(feature_dim, self.num_channels, kernel_size=1),
+ Interpolate(scale_factor=2, mode="bilinear", align_corners=True),
+ )
+ else:
+ raise ValueError('DPT head_type must be "regression" or "semseg".')
+
+ if self.dim_tokens_enc is not None:
+ self.init(dim_tokens_enc=dim_tokens_enc)
+
+ def init(self, dim_tokens_enc=768):
+ """
+ Initialize parts of decoder that are dependent on dimension of encoder tokens.
+ Should be called when setting up MultiMAE.
+
+ :param dim_tokens_enc: Dimension of tokens coming from encoder
+ """
+ # print(dim_tokens_enc)
+
+ # Set up activation postprocessing layers
+ if isinstance(dim_tokens_enc, int):
+ dim_tokens_enc = 4 * [dim_tokens_enc]
+
+ self.dim_tokens_enc = [dt * len(self.main_tasks) for dt in dim_tokens_enc]
+
+ self.act_1_postprocess = nn.Sequential(
+ nn.Conv2d(
+ in_channels=self.dim_tokens_enc[0],
+ out_channels=self.layer_dims[0],
+ kernel_size=1,
+ stride=1,
+ padding=0,
+ ),
+ nn.ConvTranspose2d(
+ in_channels=self.layer_dims[0],
+ out_channels=self.layer_dims[0],
+ kernel_size=4,
+ stride=4,
+ padding=0,
+ bias=True,
+ dilation=1,
+ groups=1,
+ ),
+ )
+
+ self.act_2_postprocess = nn.Sequential(
+ nn.Conv2d(
+ in_channels=self.dim_tokens_enc[1],
+ out_channels=self.layer_dims[1],
+ kernel_size=1,
+ stride=1,
+ padding=0,
+ ),
+ nn.ConvTranspose2d(
+ in_channels=self.layer_dims[1],
+ out_channels=self.layer_dims[1],
+ kernel_size=2,
+ stride=2,
+ padding=0,
+ bias=True,
+ dilation=1,
+ groups=1,
+ ),
+ )
+
+ self.act_3_postprocess = nn.Sequential(
+ nn.Conv2d(
+ in_channels=self.dim_tokens_enc[2],
+ out_channels=self.layer_dims[2],
+ kernel_size=1,
+ stride=1,
+ padding=0,
+ )
+ )
+
+ self.act_4_postprocess = nn.Sequential(
+ nn.Conv2d(
+ in_channels=self.dim_tokens_enc[3],
+ out_channels=self.layer_dims[3],
+ kernel_size=1,
+ stride=1,
+ padding=0,
+ ),
+ nn.Conv2d(
+ in_channels=self.layer_dims[3],
+ out_channels=self.layer_dims[3],
+ kernel_size=3,
+ stride=2,
+ padding=1,
+ ),
+ )
+
+ self.act_postprocess = nn.ModuleList(
+ [
+ self.act_1_postprocess,
+ self.act_2_postprocess,
+ self.act_3_postprocess,
+ self.act_4_postprocess,
+ ]
+ )
+
+ def adapt_tokens(self, encoder_tokens):
+ # Adapt tokens
+ x = []
+ x.append(encoder_tokens[:, :])
+ x = torch.cat(x, dim=-1)
+ return x
+
+ def forward(self, encoder_tokens: List[torch.Tensor], image_size):
+ # input_info: Dict):
+ assert (
+ self.dim_tokens_enc is not None
+ ), "Need to call init(dim_tokens_enc) function first"
+ H, W = image_size
+
+ # Number of patches in height and width
+ N_H = H // (self.stride_level * self.P_H)
+ N_W = W // (self.stride_level * self.P_W)
+
+ # Hook decoder onto 4 layers from specified ViT layers
+ layers = [encoder_tokens[hook] for hook in self.hooks]
+
+ # Extract only task-relevant tokens and ignore global tokens.
+ layers = [self.adapt_tokens(l) for l in layers]
+
+ # Reshape tokens to spatial representation
+ layers = [
+ rearrange(l, "b (nh nw) c -> b c nh nw", nh=N_H, nw=N_W) for l in layers
+ ]
+
+ layers = [self.act_postprocess[idx](l) for idx, l in enumerate(layers)]
+ # Project layers to chosen feature dim
+ layers = [self.scratch.layer_rn[idx](l) for idx, l in enumerate(layers)]
+
+ # Fuse layers using refinement stages
+ path_4 = self.scratch.refinenet4(layers[3])
+ path_3 = self.scratch.refinenet3(path_4, layers[2])
+ path_2 = self.scratch.refinenet2(path_3, layers[1])
+ path_1 = self.scratch.refinenet1(path_2, layers[0])
+
+ # Output head
+ out = self.head(path_1)
+
+ return out
diff --git a/third_party/dust3r/croco/models/head_downstream.py b/third_party/dust3r/croco/models/head_downstream.py
new file mode 100644
index 0000000000000000000000000000000000000000..27ac74095240822a69b01967386855946a3781c7
--- /dev/null
+++ b/third_party/dust3r/croco/models/head_downstream.py
@@ -0,0 +1,82 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+# --------------------------------------------------------
+# Heads for downstream tasks
+# --------------------------------------------------------
+
+"""
+A head is a module where the __init__ defines only the head hyperparameters.
+A method setup(croconet) takes a CroCoNet and set all layers according to the head and croconet attributes.
+The forward takes the features as well as a dictionary img_info containing the keys 'width' and 'height'
+"""
+
+import torch
+import torch.nn as nn
+
+from .dpt_block import DPTOutputAdapter
+
+
+class PixelwiseTaskWithDPT(nn.Module):
+ """DPT module for CroCo.
+ by default, hooks_idx will be equal to:
+ * for encoder-only: 4 equally spread layers
+ * for encoder+decoder: last encoder + 3 equally spread layers of the decoder
+ """
+
+ def __init__(
+ self,
+ *,
+ hooks_idx=None,
+ layer_dims=[96, 192, 384, 768],
+ output_width_ratio=1,
+ num_channels=1,
+ postprocess=None,
+ **kwargs,
+ ):
+ super(PixelwiseTaskWithDPT, self).__init__()
+ self.return_all_blocks = True # backbone needs to return all layers
+ self.postprocess = postprocess
+ self.output_width_ratio = output_width_ratio
+ self.num_channels = num_channels
+ self.hooks_idx = hooks_idx
+ self.layer_dims = layer_dims
+
+ def setup(self, croconet):
+ dpt_args = {
+ "output_width_ratio": self.output_width_ratio,
+ "num_channels": self.num_channels,
+ }
+ if self.hooks_idx is None:
+ if hasattr(croconet, "dec_blocks"): # encoder + decoder
+ step = {8: 3, 12: 4, 24: 8}[croconet.dec_depth]
+ hooks_idx = [
+ croconet.dec_depth + croconet.enc_depth - 1 - i * step
+ for i in range(3, -1, -1)
+ ]
+ else: # encoder only
+ step = croconet.enc_depth // 4
+ hooks_idx = [
+ croconet.enc_depth - 1 - i * step for i in range(3, -1, -1)
+ ]
+ self.hooks_idx = hooks_idx
+ print(
+ f" PixelwiseTaskWithDPT: automatically setting hook_idxs={self.hooks_idx}"
+ )
+ dpt_args["hooks"] = self.hooks_idx
+ dpt_args["layer_dims"] = self.layer_dims
+ self.dpt = DPTOutputAdapter(**dpt_args)
+ dim_tokens = [
+ croconet.enc_embed_dim
+ if hook < croconet.enc_depth
+ else croconet.dec_embed_dim
+ for hook in self.hooks_idx
+ ]
+ dpt_init_args = {"dim_tokens_enc": dim_tokens}
+ self.dpt.init(**dpt_init_args)
+
+ def forward(self, x, img_info):
+ out = self.dpt(x, image_size=(img_info["height"], img_info["width"]))
+ if self.postprocess:
+ out = self.postprocess(out)
+ return out
diff --git a/third_party/dust3r/croco/models/masking.py b/third_party/dust3r/croco/models/masking.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae18f927ae82e4075c2246ce722007c69a4da344
--- /dev/null
+++ b/third_party/dust3r/croco/models/masking.py
@@ -0,0 +1,26 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+
+# --------------------------------------------------------
+# Masking utils
+# --------------------------------------------------------
+
+import torch
+import torch.nn as nn
+
+
+class RandomMask(nn.Module):
+ """
+ random masking
+ """
+
+ def __init__(self, num_patches, mask_ratio):
+ super().__init__()
+ self.num_patches = num_patches
+ self.num_mask = int(mask_ratio * self.num_patches)
+
+ def __call__(self, x):
+ noise = torch.rand(x.size(0), self.num_patches, device=x.device)
+ argsort = torch.argsort(noise, dim=1)
+ return argsort < self.num_mask
diff --git a/third_party/dust3r/croco/models/pos_embed.py b/third_party/dust3r/croco/models/pos_embed.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6d7e19babdd4e0b69156c32a2b7dafbd6f0cbe8
--- /dev/null
+++ b/third_party/dust3r/croco/models/pos_embed.py
@@ -0,0 +1,177 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+
+# --------------------------------------------------------
+# Position embedding utils
+# --------------------------------------------------------
+
+
+import numpy as np
+import torch
+
+
+# --------------------------------------------------------
+# 2D sine-cosine position embedding
+# References:
+# MAE: https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py
+# MoCo v3: https://github.com/facebookresearch/moco-v3
+# --------------------------------------------------------
+def get_2d_sincos_pos_embed(embed_dim, grid_size, n_cls_token=0):
+ """
+ grid_size: int of the grid height and width
+ return:
+ pos_embed: [grid_size*grid_size, embed_dim] or [n_cls_token+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+ """
+ grid_h = np.arange(grid_size, dtype=np.float32)
+ grid_w = np.arange(grid_size, dtype=np.float32)
+ grid = np.meshgrid(grid_w, grid_h) # here w goes first
+ grid = np.stack(grid, axis=0)
+
+ grid = grid.reshape([2, 1, grid_size, grid_size])
+ pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+ if n_cls_token > 0:
+ pos_embed = np.concatenate(
+ [np.zeros([n_cls_token, embed_dim]), pos_embed], axis=0
+ )
+ return pos_embed
+
+
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+ assert embed_dim % 2 == 0
+
+ # use half of dimensions to encode grid_h
+ emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0]) # (H*W, D/2)
+ emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1]) # (H*W, D/2)
+
+ emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
+ return emb
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+ """
+ embed_dim: output dimension for each position
+ pos: a list of positions to be encoded: size (M,)
+ out: (M, D)
+ """
+ assert embed_dim % 2 == 0
+ omega = np.arange(embed_dim // 2, dtype=float)
+ omega /= embed_dim / 2.0
+ omega = 1.0 / 10000**omega # (D/2,)
+
+ pos = pos.reshape(-1) # (M,)
+ out = np.einsum("m,d->md", pos, omega) # (M, D/2), outer product
+
+ emb_sin = np.sin(out) # (M, D/2)
+ emb_cos = np.cos(out) # (M, D/2)
+
+ emb = np.concatenate([emb_sin, emb_cos], axis=1) # (M, D)
+ return emb
+
+
+# --------------------------------------------------------
+# Interpolate position embeddings for high-resolution
+# References:
+# MAE: https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+# DeiT: https://github.com/facebookresearch/deit
+# --------------------------------------------------------
+def interpolate_pos_embed(model, checkpoint_model):
+ if "pos_embed" in checkpoint_model:
+ pos_embed_checkpoint = checkpoint_model["pos_embed"]
+ embedding_size = pos_embed_checkpoint.shape[-1]
+ num_patches = model.patch_embed.num_patches
+ num_extra_tokens = model.pos_embed.shape[-2] - num_patches
+ # height (== width) for the checkpoint position embedding
+ orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)
+ # height (== width) for the new position embedding
+ new_size = int(num_patches**0.5)
+ # class_token and dist_token are kept unchanged
+ if orig_size != new_size:
+ print(
+ "Position interpolate from %dx%d to %dx%d"
+ % (orig_size, orig_size, new_size, new_size)
+ )
+ extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
+ # only the position tokens are interpolated
+ pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
+ pos_tokens = pos_tokens.reshape(
+ -1, orig_size, orig_size, embedding_size
+ ).permute(0, 3, 1, 2)
+ pos_tokens = torch.nn.functional.interpolate(
+ pos_tokens,
+ size=(new_size, new_size),
+ mode="bicubic",
+ align_corners=False,
+ )
+ pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
+ new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
+ checkpoint_model["pos_embed"] = new_pos_embed
+
+
+# ----------------------------------------------------------
+# RoPE2D: RoPE implementation in 2D
+# ----------------------------------------------------------
+
+try:
+ from models.curope import cuRoPE2D
+
+ RoPE2D = cuRoPE2D
+except ImportError:
+ print(
+ "Warning, cannot find cuda-compiled version of RoPE2D, using a slow pytorch version instead"
+ )
+
+ class RoPE2D(torch.nn.Module):
+ def __init__(self, freq=100.0, F0=1.0):
+ super().__init__()
+ self.base = freq
+ self.F0 = F0
+ self.cache = {}
+
+ def get_cos_sin(self, D, seq_len, device, dtype):
+ if (D, seq_len, device, dtype) not in self.cache:
+ inv_freq = 1.0 / (
+ self.base ** (torch.arange(0, D, 2).float().to(device) / D)
+ )
+ t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)
+ freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype)
+ freqs = torch.cat((freqs, freqs), dim=-1)
+ cos = freqs.cos() # (Seq, Dim)
+ sin = freqs.sin()
+ self.cache[D, seq_len, device, dtype] = (cos, sin)
+ return self.cache[D, seq_len, device, dtype]
+
+ @staticmethod
+ def rotate_half(x):
+ x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
+ return torch.cat((-x2, x1), dim=-1)
+
+ def apply_rope1d(self, tokens, pos1d, cos, sin):
+ assert pos1d.ndim == 2
+ cos = torch.nn.functional.embedding(pos1d, cos)[:, None, :, :]
+ sin = torch.nn.functional.embedding(pos1d, sin)[:, None, :, :]
+ return (tokens * cos) + (self.rotate_half(tokens) * sin)
+
+ def forward(self, tokens, positions):
+ """
+ input:
+ * tokens: batch_size x nheads x ntokens x dim
+ * positions: batch_size x ntokens x 2 (y and x position of each token)
+ output:
+ * tokens after appplying RoPE2D (batch_size x nheads x ntokens x dim)
+ """
+ assert (
+ tokens.size(3) % 2 == 0
+ ), "number of dimensions should be a multiple of two"
+ D = tokens.size(3) // 2
+ assert positions.ndim == 3 and positions.shape[-1] == 2 # Batch, Seq, 2
+ cos, sin = self.get_cos_sin(
+ D, int(positions.max()) + 1, tokens.device, tokens.dtype
+ )
+ # split features into two along the feature dimension, and apply rope1d on each half
+ y, x = tokens.chunk(2, dim=-1)
+ y = self.apply_rope1d(y, positions[:, :, 0], cos, sin)
+ x = self.apply_rope1d(x, positions[:, :, 1], cos, sin)
+ tokens = torch.cat((y, x), dim=-1)
+ return tokens
diff --git a/third_party/dust3r/croco/pretrain.py b/third_party/dust3r/croco/pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..111f72148bdbdc1f00be89c169745e2df1792c94
--- /dev/null
+++ b/third_party/dust3r/croco/pretrain.py
@@ -0,0 +1,389 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+#
+# --------------------------------------------------------
+# Pre-training CroCo
+# --------------------------------------------------------
+# References:
+# MAE: https://github.com/facebookresearch/mae
+# DeiT: https://github.com/facebookresearch/deit
+# BEiT: https://github.com/microsoft/unilm/tree/master/beit
+# --------------------------------------------------------
+import argparse
+import datetime
+import json
+import math
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Iterable
+
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import torch.distributed as dist
+import torchvision.datasets as datasets
+import torchvision.transforms as transforms
+import utils.misc as misc
+from datasets.pairs_dataset import PairsDataset
+from models.criterion import MaskedMSE
+from models.croco import CroCoNet
+from torch.utils.tensorboard import SummaryWriter
+from utils.misc import NativeScalerWithGradNormCount as NativeScaler
+
+
+def get_args_parser():
+ parser = argparse.ArgumentParser("CroCo pre-training", add_help=False)
+ # model and criterion
+ parser.add_argument(
+ "--model",
+ default="CroCoNet()",
+ type=str,
+ help="string containing the model to build",
+ )
+ parser.add_argument(
+ "--norm_pix_loss",
+ default=1,
+ choices=[0, 1],
+ help="apply per-patch mean/std normalization before applying the loss",
+ )
+ # dataset
+ parser.add_argument(
+ "--dataset", default="habitat_release", type=str, help="training set"
+ )
+ parser.add_argument(
+ "--transforms", default="crop224+acolor", type=str, help="transforms to apply"
+ ) # in the paper, we also use some homography and rotation, but find later that they were not useful or even harmful
+ # training
+ parser.add_argument("--seed", default=0, type=int, help="Random seed")
+ parser.add_argument(
+ "--batch_size",
+ default=64,
+ type=int,
+ help="Batch size per GPU (effective batch size is batch_size * accum_iter * # gpus",
+ )
+ parser.add_argument(
+ "--epochs",
+ default=800,
+ type=int,
+ help="Maximum number of epochs for the scheduler",
+ )
+ parser.add_argument(
+ "--max_epoch", default=400, type=int, help="Stop training at this epoch"
+ )
+ parser.add_argument(
+ "--accum_iter",
+ default=1,
+ type=int,
+ help="Accumulate gradient iterations (for increasing the effective batch size under memory constraints)",
+ )
+ parser.add_argument(
+ "--weight_decay", type=float, default=0.05, help="weight decay (default: 0.05)"
+ )
+ parser.add_argument(
+ "--lr",
+ type=float,
+ default=None,
+ metavar="LR",
+ help="learning rate (absolute lr)",
+ )
+ parser.add_argument(
+ "--blr",
+ type=float,
+ default=1.5e-4,
+ metavar="LR",
+ help="base learning rate: absolute_lr = base_lr * total_batch_size / 256",
+ )
+ parser.add_argument(
+ "--min_lr",
+ type=float,
+ default=0.0,
+ metavar="LR",
+ help="lower lr bound for cyclic schedulers that hit 0",
+ )
+ parser.add_argument(
+ "--warmup_epochs", type=int, default=40, metavar="N", help="epochs to warmup LR"
+ )
+ parser.add_argument(
+ "--amp",
+ type=int,
+ default=1,
+ choices=[0, 1],
+ help="Use Automatic Mixed Precision for pretraining",
+ )
+ # others
+ parser.add_argument("--num_workers", default=8, type=int)
+ parser.add_argument(
+ "--world_size", default=1, type=int, help="number of distributed processes"
+ )
+ parser.add_argument("--local_rank", default=-1, type=int)
+ parser.add_argument(
+ "--dist_url", default="env://", help="url used to set up distributed training"
+ )
+ parser.add_argument(
+ "--save_freq",
+ default=1,
+ type=int,
+ help="frequence (number of epochs) to save checkpoint in checkpoint-last.pth",
+ )
+ parser.add_argument(
+ "--keep_freq",
+ default=20,
+ type=int,
+ help="frequence (number of epochs) to save checkpoint in checkpoint-%d.pth",
+ )
+ parser.add_argument(
+ "--print_freq",
+ default=20,
+ type=int,
+ help="frequence (number of iterations) to print infos while training",
+ )
+ # paths
+ parser.add_argument(
+ "--output_dir",
+ default="./output/",
+ type=str,
+ help="path where to save the output",
+ )
+ parser.add_argument(
+ "--data_dir", default="./data/", type=str, help="path where data are stored"
+ )
+ return parser
+
+
+def main(args):
+ misc.init_distributed_mode(args)
+ global_rank = misc.get_rank()
+ world_size = misc.get_world_size()
+
+ print("output_dir: " + args.output_dir)
+ if args.output_dir:
+ Path(args.output_dir).mkdir(parents=True, exist_ok=True)
+
+ # auto resume
+ last_ckpt_fname = os.path.join(args.output_dir, f"checkpoint-last.pth")
+ args.resume = last_ckpt_fname if os.path.isfile(last_ckpt_fname) else None
+
+ print("job dir: {}".format(os.path.dirname(os.path.realpath(__file__))))
+ print("{}".format(args).replace(", ", ",\n"))
+
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ device = torch.device(device)
+
+ # fix the seed
+ seed = args.seed + misc.get_rank()
+ torch.manual_seed(seed)
+ np.random.seed(seed)
+
+ cudnn.benchmark = True
+
+ ## training dataset and loader
+ print(
+ "Building dataset for {:s} with transforms {:s}".format(
+ args.dataset, args.transforms
+ )
+ )
+ dataset = PairsDataset(args.dataset, trfs=args.transforms, data_dir=args.data_dir)
+ if world_size > 1:
+ sampler_train = torch.utils.data.DistributedSampler(
+ dataset, num_replicas=world_size, rank=global_rank, shuffle=True
+ )
+ print("Sampler_train = %s" % str(sampler_train))
+ else:
+ sampler_train = torch.utils.data.RandomSampler(dataset)
+ data_loader_train = torch.utils.data.DataLoader(
+ dataset,
+ sampler=sampler_train,
+ batch_size=args.batch_size,
+ num_workers=args.num_workers,
+ pin_memory=True,
+ drop_last=True,
+ )
+
+ ## model
+ print("Loading model: {:s}".format(args.model))
+ model = eval(args.model)
+ print(
+ "Loading criterion: MaskedMSE(norm_pix_loss={:s})".format(
+ str(bool(args.norm_pix_loss))
+ )
+ )
+ criterion = MaskedMSE(norm_pix_loss=bool(args.norm_pix_loss))
+
+ model.to(device)
+ model_without_ddp = model
+ print("Model = %s" % str(model_without_ddp))
+
+ eff_batch_size = args.batch_size * args.accum_iter * misc.get_world_size()
+ if args.lr is None: # only base_lr is specified
+ args.lr = args.blr * eff_batch_size / 256
+ print("base lr: %.2e" % (args.lr * 256 / eff_batch_size))
+ print("actual lr: %.2e" % args.lr)
+ print("accumulate grad iterations: %d" % args.accum_iter)
+ print("effective batch size: %d" % eff_batch_size)
+
+ if args.distributed:
+ model = torch.nn.parallel.DistributedDataParallel(
+ model, device_ids=[args.gpu], find_unused_parameters=True, static_graph=True
+ )
+ model_without_ddp = model.module
+
+ param_groups = misc.get_parameter_groups(
+ model_without_ddp, args.weight_decay
+ ) # following timm: set wd as 0 for bias and norm layers
+ optimizer = torch.optim.AdamW(param_groups, lr=args.lr, betas=(0.9, 0.95))
+ print(optimizer)
+ loss_scaler = NativeScaler()
+
+ misc.load_model(
+ args=args,
+ model_without_ddp=model_without_ddp,
+ optimizer=optimizer,
+ loss_scaler=loss_scaler,
+ )
+
+ if global_rank == 0 and args.output_dir is not None:
+ log_writer = SummaryWriter(log_dir=args.output_dir)
+ else:
+ log_writer = None
+
+ print(f"Start training until {args.max_epoch} epochs")
+ start_time = time.time()
+ for epoch in range(args.start_epoch, args.max_epoch):
+ if world_size > 1:
+ data_loader_train.sampler.set_epoch(epoch)
+
+ train_stats = train_one_epoch(
+ model,
+ criterion,
+ data_loader_train,
+ optimizer,
+ device,
+ epoch,
+ loss_scaler,
+ log_writer=log_writer,
+ args=args,
+ )
+
+ if args.output_dir and epoch % args.save_freq == 0:
+ misc.save_model(
+ args=args,
+ model_without_ddp=model_without_ddp,
+ optimizer=optimizer,
+ loss_scaler=loss_scaler,
+ epoch=epoch,
+ fname="last",
+ )
+
+ if (
+ args.output_dir
+ and (epoch % args.keep_freq == 0 or epoch + 1 == args.max_epoch)
+ and (epoch > 0 or args.max_epoch == 1)
+ ):
+ misc.save_model(
+ args=args,
+ model_without_ddp=model_without_ddp,
+ optimizer=optimizer,
+ loss_scaler=loss_scaler,
+ epoch=epoch,
+ )
+
+ log_stats = {
+ **{f"train_{k}": v for k, v in train_stats.items()},
+ "epoch": epoch,
+ }
+
+ if args.output_dir and misc.is_main_process():
+ if log_writer is not None:
+ log_writer.flush()
+ with open(
+ os.path.join(args.output_dir, "log.txt"), mode="a", encoding="utf-8"
+ ) as f:
+ f.write(json.dumps(log_stats) + "\n")
+
+ total_time = time.time() - start_time
+ total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+ print("Training time {}".format(total_time_str))
+
+
+def train_one_epoch(
+ model: torch.nn.Module,
+ criterion: torch.nn.Module,
+ data_loader: Iterable,
+ optimizer: torch.optim.Optimizer,
+ device: torch.device,
+ epoch: int,
+ loss_scaler,
+ log_writer=None,
+ args=None,
+):
+ model.train(True)
+ metric_logger = misc.MetricLogger(delimiter=" ")
+ metric_logger.add_meter("lr", misc.SmoothedValue(window_size=1, fmt="{value:.6f}"))
+ header = "Epoch: [{}]".format(epoch)
+ accum_iter = args.accum_iter
+
+ optimizer.zero_grad()
+
+ if log_writer is not None:
+ print("log_dir: {}".format(log_writer.log_dir))
+
+ for data_iter_step, (image1, image2) in enumerate(
+ metric_logger.log_every(data_loader, args.print_freq, header)
+ ):
+ # we use a per iteration lr scheduler
+ if data_iter_step % accum_iter == 0:
+ misc.adjust_learning_rate(
+ optimizer, data_iter_step / len(data_loader) + epoch, args
+ )
+
+ image1 = image1.to(device, non_blocking=True)
+ image2 = image2.to(device, non_blocking=True)
+ with torch.cuda.amp.autocast(enabled=bool(args.amp)):
+ out, mask, target = model(image1, image2)
+ loss = criterion(out, mask, target)
+
+ loss_value = loss.item()
+
+ if not math.isfinite(loss_value):
+ print("Loss is {}, stopping training".format(loss_value))
+ sys.exit(1)
+
+ loss /= accum_iter
+ loss_scaler(
+ loss,
+ optimizer,
+ parameters=model.parameters(),
+ update_grad=(data_iter_step + 1) % accum_iter == 0,
+ )
+ if (data_iter_step + 1) % accum_iter == 0:
+ optimizer.zero_grad()
+
+ torch.cuda.synchronize()
+
+ metric_logger.update(loss=loss_value)
+
+ lr = optimizer.param_groups[0]["lr"]
+ metric_logger.update(lr=lr)
+
+ loss_value_reduce = misc.all_reduce_mean(loss_value)
+ if (
+ log_writer is not None
+ and ((data_iter_step + 1) % (accum_iter * args.print_freq)) == 0
+ ):
+ # x-axis is based on epoch_1000x in the tensorboard, calibrating differences curves when batch size changes
+ epoch_1000x = int((data_iter_step / len(data_loader) + epoch) * 1000)
+ log_writer.add_scalar("train_loss", loss_value_reduce, epoch_1000x)
+ log_writer.add_scalar("lr", lr, epoch_1000x)
+
+ # gather the stats from all processes
+ metric_logger.synchronize_between_processes()
+ print("Averaged stats:", metric_logger)
+ return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+
+
+if __name__ == "__main__":
+ args = get_args_parser()
+ args = args.parse_args()
+ main(args)
diff --git a/third_party/dust3r/croco/stereoflow/README.MD b/third_party/dust3r/croco/stereoflow/README.MD
new file mode 100644
index 0000000000000000000000000000000000000000..81595380fadd274b523e0cf77921b1b65cbedb34
--- /dev/null
+++ b/third_party/dust3r/croco/stereoflow/README.MD
@@ -0,0 +1,318 @@
+## CroCo-Stereo and CroCo-Flow
+
+This README explains how to use CroCo-Stereo and CroCo-Flow as well as how they were trained.
+All commands should be launched from the root directory.
+
+### Simple inference example
+
+We provide a simple inference exemple for CroCo-Stereo and CroCo-Flow in the Totebook `croco-stereo-flow-demo.ipynb`.
+Before running it, please download the trained models with:
+```
+bash stereoflow/download_model.sh crocostereo.pth
+bash stereoflow/download_model.sh crocoflow.pth
+```
+
+### Prepare data for training or evaluation
+
+Put the datasets used for training/evaluation in `./data/stereoflow` (or update the paths at the top of `stereoflow/datasets_stereo.py` and `stereoflow/datasets_flow.py`).
+Please find below on the file structure should look for each dataset:
+
+FlyingChairs
+
+```
+./data/stereoflow/FlyingChairs/
+└───chairs_split.txt
+└───data/
+ └─── ...
+```
+
+
+
+MPI-Sintel
+
+```
+./data/stereoflow/MPI-Sintel/
+└───training/
+│ └───clean/
+│ └───final/
+│ └───flow/
+└───test/
+ └───clean/
+ └───final/
+```
+
+
+
+SceneFlow (including FlyingThings)
+
+```
+./data/stereoflow/SceneFlow/
+└───Driving/
+│ └───disparity/
+│ └───frames_cleanpass/
+│ └───frames_finalpass/
+└───FlyingThings/
+│ └───disparity/
+│ └───frames_cleanpass/
+│ └───frames_finalpass/
+│ └───optical_flow/
+└───Monkaa/
+ └───disparity/
+ └───frames_cleanpass/
+ └───frames_finalpass/
+```
+
+
+
+TartanAir
+
+```
+./data/stereoflow/TartanAir/
+└───abandonedfactory/
+│ └───.../
+└───abandonedfactory_night/
+│ └───.../
+└───.../
+```
+
+
+
+Booster
+
+```
+./data/stereoflow/booster_gt/
+└───train/
+ └───balanced/
+ └───Bathroom/
+ └───Bedroom/
+ └───...
+```
+
+
+
+CREStereo
+
+```
+./data/stereoflow/crenet_stereo_trainset/
+└───stereo_trainset/
+ └───crestereo/
+ └───hole/
+ └───reflective/
+ └───shapenet/
+ └───tree/
+```
+
+
+
+ETH3D Two-view Low-res
+
+```
+./data/stereoflow/eth3d_lowres/
+└───test/
+│ └───lakeside_1l/
+│ └───...
+└───train/
+│ └───delivery_area_1l/
+│ └───...
+└───train_gt/
+ └───delivery_area_1l/
+ └───...
+```
+
+
+
+KITTI 2012
+
+```
+./data/stereoflow/kitti-stereo-2012/
+└───testing/
+│ └───colored_0/
+│ └───colored_1/
+└───training/
+ └───colored_0/
+ └───colored_1/
+ └───disp_occ/
+ └───flow_occ/
+```
+
+
+
+KITTI 2015
+
+```
+./data/stereoflow/kitti-stereo-2015/
+└───testing/
+│ └───image_2/
+│ └───image_3/
+└───training/
+ └───image_2/
+ └───image_3/
+ └───disp_occ_0/
+ └───flow_occ/
+```
+
+
+
+Middlebury
+
+```
+./data/stereoflow/middlebury
+└───2005/
+│ └───train/
+│ └───Art/
+│ └───...
+└───2006/
+│ └───Aloe/
+│ └───Baby1/
+│ └───...
+└───2014/
+│ └───Adirondack-imperfect/
+│ └───Adirondack-perfect/
+│ └───...
+└───2021/
+│ └───data/
+│ └───artroom1/
+│ └───artroom2/
+│ └───...
+└───MiddEval3_F/
+ └───test/
+ │ └───Australia/
+ │ └───...
+ └───train/
+ └───Adirondack/
+ └───...
+```
+
+
+
+Spring
+
+```
+./data/stereoflow/spring/
+└───test/
+│ └───0003/
+│ └───...
+└───train/
+ └───0001/
+ └───...
+```
+
+
+
+### CroCo-Stereo
+
+##### Main model
+
+The main training of CroCo-Stereo was performed on a series of datasets, and it was used as it for Middlebury v3 benchmark.
+
+```
+# Download the model
+bash stereoflow/download_model.sh crocostereo.pth
+# Middlebury v3 submission
+python stereoflow/test.py --model stereoflow_models/crocostereo.pth --dataset "MdEval3('all_full')" --save submission --tile_overlap 0.9
+# Training command that was used, using checkpoint-last.pth
+python -u stereoflow/train.py stereo --criterion "LaplacianLossBounded2()" --dataset "CREStereo('train')+SceneFlow('train_allpass')+30*ETH3DLowRes('train')+50*Md05('train')+50*Md06('train')+50*Md14('train')+50*Md21('train')+50*MdEval3('train_full')+Booster('train_balanced')" --val_dataset "SceneFlow('test1of100_finalpass')+SceneFlow('test1of100_cleanpass')+ETH3DLowRes('subval')+Md05('subval')+Md06('subval')+Md14('subval')+Md21('subval')+MdEval3('subval_full')+Booster('subval_balanced')" --lr 3e-5 --batch_size 6 --epochs 32 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --output_dir xps/crocostereo/main/
+# or it can be launched on multiple gpus (while maintaining the effective batch size), e.g. on 3 gpus:
+torchrun --nproc_per_node 3 stereoflow/train.py stereo --criterion "LaplacianLossBounded2()" --dataset "CREStereo('train')+SceneFlow('train_allpass')+30*ETH3DLowRes('train')+50*Md05('train')+50*Md06('train')+50*Md14('train')+50*Md21('train')+50*MdEval3('train_full')+Booster('train_balanced')" --val_dataset "SceneFlow('test1of100_finalpass')+SceneFlow('test1of100_cleanpass')+ETH3DLowRes('subval')+Md05('subval')+Md06('subval')+Md14('subval')+Md21('subval')+MdEval3('subval_full')+Booster('subval_balanced')" --lr 3e-5 --batch_size 2 --epochs 32 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --output_dir xps/crocostereo/main/
+```
+
+For evaluation of validation set, we also provide the model trained on the `subtrain` subset of the training sets.
+
+```
+# Download the model
+bash stereoflow/download_model.sh crocostereo_subtrain.pth
+# Evaluation on validation sets
+python stereoflow/test.py --model stereoflow_models/crocostereo_subtrain.pth --dataset "MdEval3('subval_full')+ETH3DLowRes('subval')+SceneFlow('test_finalpass')+SceneFlow('test_cleanpass')" --save metrics --tile_overlap 0.9
+# Training command that was used (same as above but on subtrain, using checkpoint-best.pth), can also be launched on multiple gpus
+python -u stereoflow/train.py stereo --criterion "LaplacianLossBounded2()" --dataset "CREStereo('train')+SceneFlow('train_allpass')+30*ETH3DLowRes('subtrain')+50*Md05('subtrain')+50*Md06('subtrain')+50*Md14('subtrain')+50*Md21('subtrain')+50*MdEval3('subtrain_full')+Booster('subtrain_balanced')" --val_dataset "SceneFlow('test1of100_finalpass')+SceneFlow('test1of100_cleanpass')+ETH3DLowRes('subval')+Md05('subval')+Md06('subval')+Md14('subval')+Md21('subval')+MdEval3('subval_full')+Booster('subval_balanced')" --lr 3e-5 --batch_size 6 --epochs 32 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --output_dir xps/crocostereo/main_subtrain/
+```
+
+##### Other models
+
+
+ Model for ETH3D
+ The model used for the submission on ETH3D is trained with the same command but using an unbounded Laplacian loss.
+
+ # Download the model
+ bash stereoflow/download_model.sh crocostereo_eth3d.pth
+ # ETH3D submission
+ python stereoflow/test.py --model stereoflow_models/crocostereo_eth3d.pth --dataset "ETH3DLowRes('all')" --save submission --tile_overlap 0.9
+ # Training command that was used
+ python -u stereoflow/train.py stereo --criterion "LaplacianLoss()" --tile_conf_mode conf_expbeta3 --dataset "CREStereo('train')+SceneFlow('train_allpass')+30*ETH3DLowRes('train')+50*Md05('train')+50*Md06('train')+50*Md14('train')+50*Md21('train')+50*MdEval3('train_full')+Booster('train_balanced')" --val_dataset "SceneFlow('test1of100_finalpass')+SceneFlow('test1of100_cleanpass')+ETH3DLowRes('subval')+Md05('subval')+Md06('subval')+Md14('subval')+Md21('subval')+MdEval3('subval_full')+Booster('subval_balanced')" --lr 3e-5 --batch_size 6 --epochs 32 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --output_dir xps/crocostereo/main_eth3d/
+
+
+
+
+ Main model finetuned on Kitti
+
+ # Download the model
+ bash stereoflow/download_model.sh crocostereo_finetune_kitti.pth
+ # Kitti submission
+ python stereoflow/test.py --model stereoflow_models/crocostereo_finetune_kitti.pth --dataset "Kitti15('test')" --save submission --tile_overlap 0.9
+ # Training that was used
+ python -u stereoflow/train.py stereo --crop 352 1216 --criterion "LaplacianLossBounded2()" --dataset "Kitti12('train')+Kitti15('train')" --lr 3e-5 --batch_size 1 --accum_iter 6 --epochs 20 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --start_from stereoflow_models/crocostereo.pth --output_dir xps/crocostereo/finetune_kitti/ --save_every 5
+
+
+
+ Main model finetuned on Spring
+
+ # Download the model
+ bash stereoflow/download_model.sh crocostereo_finetune_spring.pth
+ # Spring submission
+ python stereoflow/test.py --model stereoflow_models/crocostereo_finetune_spring.pth --dataset "Spring('test')" --save submission --tile_overlap 0.9
+ # Training command that was used
+ python -u stereoflow/train.py stereo --criterion "LaplacianLossBounded2()" --dataset "Spring('train')" --lr 3e-5 --batch_size 6 --epochs 8 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --start_from stereoflow_models/crocostereo.pth --output_dir xps/crocostereo/finetune_spring/
+
+
+
+ Smaller models
+ To train CroCo-Stereo with smaller CroCo pretrained models, simply replace the --pretrained
argument. To download the smaller CroCo-Stereo models based on CroCo v2 pretraining with ViT-Base encoder and Small encoder, use bash stereoflow/download_model.sh crocostereo_subtrain_vitb_smalldecoder.pth
, and for the model with a ViT-Base encoder and a Base decoder, use bash stereoflow/download_model.sh crocostereo_subtrain_vitb_basedecoder.pth
.
+
+
+
+### CroCo-Flow
+
+##### Main model
+
+The main training of CroCo-Flow was performed on the FlyingThings, FlyingChairs, MPI-Sintel and TartanAir datasets.
+It was used for our submission to the MPI-Sintel benchmark.
+
+```
+# Download the model
+bash stereoflow/download_model.sh crocoflow.pth
+# Evaluation
+python stereoflow/test.py --model stereoflow_models/crocoflow.pth --dataset "MPISintel('subval_cleanpass')+MPISintel('subval_finalpass')" --save metrics --tile_overlap 0.9
+# Sintel submission
+python stereoflow/test.py --model stereoflow_models/crocoflow.pth --dataset "MPISintel('test_allpass')" --save submission --tile_overlap 0.9
+# Training command that was used, with checkpoint-best.pth
+python -u stereoflow/train.py flow --criterion "LaplacianLossBounded()" --dataset "40*MPISintel('subtrain_cleanpass')+40*MPISintel('subtrain_finalpass')+4*FlyingThings('train_allpass')+4*FlyingChairs('train')+TartanAir('train')" --val_dataset "MPISintel('subval_cleanpass')+MPISintel('subval_finalpass')" --lr 2e-5 --batch_size 8 --epochs 240 --img_per_epoch 30000 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --output_dir xps/crocoflow/main/
+```
+
+##### Other models
+
+
+ Main model finetuned on Kitti
+
+ # Download the model
+ bash stereoflow/download_model.sh crocoflow_finetune_kitti.pth
+ # Kitti submission
+ python stereoflow/test.py --model stereoflow_models/crocoflow_finetune_kitti.pth --dataset "Kitti15('test')" --save submission --tile_overlap 0.99
+ # Training that was used, with checkpoint-last.pth
+ python -u stereoflow/train.py flow --crop 352 1216 --criterion "LaplacianLossBounded()" --dataset "Kitti15('train')+Kitti12('train')" --lr 2e-5 --batch_size 1 --accum_iter 8 --epochs 150 --save_every 5 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --start_from stereoflow_models/crocoflow.pth --output_dir xps/crocoflow/finetune_kitti/
+
+
+
+ Main model finetuned on Spring
+
+ # Download the model
+ bash stereoflow/download_model.sh crocoflow_finetune_spring.pth
+ # Spring submission
+ python stereoflow/test.py --model stereoflow_models/crocoflow_finetune_spring.pth --dataset "Spring('test')" --save submission --tile_overlap 0.9
+ # Training command that was used, with checkpoint-last.pth
+ python -u stereoflow/train.py flow --criterion "LaplacianLossBounded()" --dataset "Spring('train')" --lr 2e-5 --batch_size 8 --epochs 12 --pretrained pretrained_models/CroCo_V2_ViTLarge_BaseDecoder.pth --start_from stereoflow_models/crocoflow.pth --output_dir xps/crocoflow/finetune_spring/
+
+
+
+ Smaller models
+ To train CroCo-Flow with smaller CroCo pretrained models, simply replace the --pretrained
argument. To download the smaller CroCo-Flow models based on CroCo v2 pretraining with ViT-Base encoder and Small encoder, use bash stereoflow/download_model.sh crocoflow_vitb_smalldecoder.pth
, and for the model with a ViT-Base encoder and a Base decoder, use bash stereoflow/download_model.sh crocoflow_vitb_basedecoder.pth
.
+
diff --git a/third_party/dust3r/croco/stereoflow/augmentor.py b/third_party/dust3r/croco/stereoflow/augmentor.py
new file mode 100644
index 0000000000000000000000000000000000000000..c418525739bf61f6395c087dcbbb57302ea7c0c0
--- /dev/null
+++ b/third_party/dust3r/croco/stereoflow/augmentor.py
@@ -0,0 +1,388 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+# --------------------------------------------------------
+# Data augmentation for training stereo and flow
+# --------------------------------------------------------
+
+# References
+# https://github.com/autonomousvision/unimatch/blob/master/dataloader/stereo/transforms.py
+# https://github.com/autonomousvision/unimatch/blob/master/dataloader/flow/transforms.py
+
+
+import random
+
+import cv2
+import numpy as np
+from PIL import Image
+
+cv2.setNumThreads(0)
+cv2.ocl.setUseOpenCL(False)
+
+import torch
+import torchvision.transforms.functional as FF
+from torchvision.transforms import ColorJitter
+
+
+class StereoAugmentor(object):
+ def __init__(
+ self,
+ crop_size,
+ scale_prob=0.5,
+ scale_xonly=True,
+ lhth=800.0,
+ lminscale=0.0,
+ lmaxscale=1.0,
+ hminscale=-0.2,
+ hmaxscale=0.4,
+ scale_interp_nearest=True,
+ rightjitterprob=0.5,
+ v_flip_prob=0.5,
+ color_aug_asym=True,
+ color_choice_prob=0.5,
+ ):
+ self.crop_size = crop_size
+ self.scale_prob = scale_prob
+ self.scale_xonly = scale_xonly
+ self.lhth = lhth
+ self.lminscale = lminscale
+ self.lmaxscale = lmaxscale
+ self.hminscale = hminscale
+ self.hmaxscale = hmaxscale
+ self.scale_interp_nearest = scale_interp_nearest
+ self.rightjitterprob = rightjitterprob
+ self.v_flip_prob = v_flip_prob
+ self.color_aug_asym = color_aug_asym
+ self.color_choice_prob = color_choice_prob
+
+ def _random_scale(self, img1, img2, disp):
+ ch, cw = self.crop_size
+ h, w = img1.shape[:2]
+ if self.scale_prob > 0.0 and np.random.rand() < self.scale_prob:
+ min_scale, max_scale = (
+ (self.lminscale, self.lmaxscale)
+ if min(h, w) < self.lhth
+ else (self.hminscale, self.hmaxscale)
+ )
+ scale_x = 2.0 ** np.random.uniform(min_scale, max_scale)
+ scale_x = np.clip(scale_x, (cw + 8) / float(w), None)
+ scale_y = 1.0
+ if not self.scale_xonly:
+ scale_y = scale_x
+ scale_y = np.clip(scale_y, (ch + 8) / float(h), None)
+ img1 = cv2.resize(
+ img1, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
+ )
+ img2 = cv2.resize(
+ img2, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
+ )
+ disp = (
+ cv2.resize(
+ disp,
+ None,
+ fx=scale_x,
+ fy=scale_y,
+ interpolation=cv2.INTER_LINEAR
+ if not self.scale_interp_nearest
+ else cv2.INTER_NEAREST,
+ )
+ * scale_x
+ )
+ else: # check if we need to resize to be able to crop
+ h, w = img1.shape[:2]
+ clip_scale = (cw + 8) / float(w)
+ if clip_scale > 1.0:
+ scale_x = clip_scale
+ scale_y = scale_x if not self.scale_xonly else 1.0
+ img1 = cv2.resize(
+ img1, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
+ )
+ img2 = cv2.resize(
+ img2, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
+ )
+ disp = (
+ cv2.resize(
+ disp,
+ None,
+ fx=scale_x,
+ fy=scale_y,
+ interpolation=cv2.INTER_LINEAR
+ if not self.scale_interp_nearest
+ else cv2.INTER_NEAREST,
+ )
+ * scale_x
+ )
+ return img1, img2, disp
+
+ def _random_crop(self, img1, img2, disp):
+ h, w = img1.shape[:2]
+ ch, cw = self.crop_size
+ assert ch <= h and cw <= w, (img1.shape, h, w, ch, cw)
+ offset_x = np.random.randint(w - cw + 1)
+ offset_y = np.random.randint(h - ch + 1)
+ img1 = img1[offset_y : offset_y + ch, offset_x : offset_x + cw]
+ img2 = img2[offset_y : offset_y + ch, offset_x : offset_x + cw]
+ disp = disp[offset_y : offset_y + ch, offset_x : offset_x + cw]
+ return img1, img2, disp
+
+ def _random_vflip(self, img1, img2, disp):
+ # vertical flip
+ if self.v_flip_prob > 0 and np.random.rand() < self.v_flip_prob:
+ img1 = np.copy(np.flipud(img1))
+ img2 = np.copy(np.flipud(img2))
+ disp = np.copy(np.flipud(disp))
+ return img1, img2, disp
+
+ def _random_rotate_shift_right(self, img2):
+ if self.rightjitterprob > 0.0 and np.random.rand() < self.rightjitterprob:
+ angle, pixel = 0.1, 2
+ px = np.random.uniform(-pixel, pixel)
+ ag = np.random.uniform(-angle, angle)
+ image_center = (
+ np.random.uniform(0, img2.shape[0]),
+ np.random.uniform(0, img2.shape[1]),
+ )
+ rot_mat = cv2.getRotationMatrix2D(image_center, ag, 1.0)
+ img2 = cv2.warpAffine(
+ img2, rot_mat, img2.shape[1::-1], flags=cv2.INTER_LINEAR
+ )
+ trans_mat = np.float32([[1, 0, 0], [0, 1, px]])
+ img2 = cv2.warpAffine(
+ img2, trans_mat, img2.shape[1::-1], flags=cv2.INTER_LINEAR
+ )
+ return img2
+
+ def _random_color_contrast(self, img1, img2):
+ if np.random.random() < 0.5:
+ contrast_factor = np.random.uniform(0.8, 1.2)
+ img1 = FF.adjust_contrast(img1, contrast_factor)
+ if self.color_aug_asym and np.random.random() < 0.5:
+ contrast_factor = np.random.uniform(0.8, 1.2)
+ img2 = FF.adjust_contrast(img2, contrast_factor)
+ return img1, img2
+
+ def _random_color_gamma(self, img1, img2):
+ if np.random.random() < 0.5:
+ gamma = np.random.uniform(0.7, 1.5)
+ img1 = FF.adjust_gamma(img1, gamma)
+ if self.color_aug_asym and np.random.random() < 0.5:
+ gamma = np.random.uniform(0.7, 1.5)
+ img2 = FF.adjust_gamma(img2, gamma)
+ return img1, img2
+
+ def _random_color_brightness(self, img1, img2):
+ if np.random.random() < 0.5:
+ brightness = np.random.uniform(0.5, 2.0)
+ img1 = FF.adjust_brightness(img1, brightness)
+ if self.color_aug_asym and np.random.random() < 0.5:
+ brightness = np.random.uniform(0.5, 2.0)
+ img2 = FF.adjust_brightness(img2, brightness)
+ return img1, img2
+
+ def _random_color_hue(self, img1, img2):
+ if np.random.random() < 0.5:
+ hue = np.random.uniform(-0.1, 0.1)
+ img1 = FF.adjust_hue(img1, hue)
+ if self.color_aug_asym and np.random.random() < 0.5:
+ hue = np.random.uniform(-0.1, 0.1)
+ img2 = FF.adjust_hue(img2, hue)
+ return img1, img2
+
+ def _random_color_saturation(self, img1, img2):
+ if np.random.random() < 0.5:
+ saturation = np.random.uniform(0.8, 1.2)
+ img1 = FF.adjust_saturation(img1, saturation)
+ if self.color_aug_asym and np.random.random() < 0.5:
+ saturation = np.random.uniform(-0.8, 1.2)
+ img2 = FF.adjust_saturation(img2, saturation)
+ return img1, img2
+
+ def _random_color(self, img1, img2):
+ trfs = [
+ self._random_color_contrast,
+ self._random_color_gamma,
+ self._random_color_brightness,
+ self._random_color_hue,
+ self._random_color_saturation,
+ ]
+ img1 = Image.fromarray(img1.astype("uint8"))
+ img2 = Image.fromarray(img2.astype("uint8"))
+ if np.random.random() < self.color_choice_prob:
+ # A single transform
+ t = random.choice(trfs)
+ img1, img2 = t(img1, img2)
+ else:
+ # Combination of trfs
+ # Random order
+ random.shuffle(trfs)
+ for t in trfs:
+ img1, img2 = t(img1, img2)
+ img1 = np.array(img1).astype(np.float32)
+ img2 = np.array(img2).astype(np.float32)
+ return img1, img2
+
+ def __call__(self, img1, img2, disp, dataset_name):
+ img1, img2, disp = self._random_scale(img1, img2, disp)
+ img1, img2, disp = self._random_crop(img1, img2, disp)
+ img1, img2, disp = self._random_vflip(img1, img2, disp)
+ img2 = self._random_rotate_shift_right(img2)
+ img1, img2 = self._random_color(img1, img2)
+ return img1, img2, disp
+
+
+class FlowAugmentor:
+ def __init__(
+ self,
+ crop_size,
+ min_scale=-0.2,
+ max_scale=0.5,
+ spatial_aug_prob=0.8,
+ stretch_prob=0.8,
+ max_stretch=0.2,
+ h_flip_prob=0.5,
+ v_flip_prob=0.1,
+ asymmetric_color_aug_prob=0.2,
+ ):
+ # spatial augmentation params
+ self.crop_size = crop_size
+ self.min_scale = min_scale
+ self.max_scale = max_scale
+ self.spatial_aug_prob = spatial_aug_prob
+ self.stretch_prob = stretch_prob
+ self.max_stretch = max_stretch
+
+ # flip augmentation params
+ self.h_flip_prob = h_flip_prob
+ self.v_flip_prob = v_flip_prob
+
+ # photometric augmentation params
+ self.photo_aug = ColorJitter(
+ brightness=0.4, contrast=0.4, saturation=0.4, hue=0.5 / 3.14
+ )
+
+ self.asymmetric_color_aug_prob = asymmetric_color_aug_prob
+
+ def color_transform(self, img1, img2):
+ """Photometric augmentation"""
+
+ # asymmetric
+ if np.random.rand() < self.asymmetric_color_aug_prob:
+ img1 = np.array(self.photo_aug(Image.fromarray(img1)), dtype=np.uint8)
+ img2 = np.array(self.photo_aug(Image.fromarray(img2)), dtype=np.uint8)
+
+ # symmetric
+ else:
+ image_stack = np.concatenate([img1, img2], axis=0)
+ image_stack = np.array(
+ self.photo_aug(Image.fromarray(image_stack)), dtype=np.uint8
+ )
+ img1, img2 = np.split(image_stack, 2, axis=0)
+
+ return img1, img2
+
+ def _resize_flow(self, flow, scale_x, scale_y, factor=1.0):
+ if np.all(np.isfinite(flow)):
+ flow = cv2.resize(
+ flow,
+ None,
+ fx=scale_x / factor,
+ fy=scale_y / factor,
+ interpolation=cv2.INTER_LINEAR,
+ )
+ flow = flow * [scale_x, scale_y]
+ else: # sparse version
+ fx, fy = scale_x, scale_y
+ ht, wd = flow.shape[:2]
+ coords = np.meshgrid(np.arange(wd), np.arange(ht))
+ coords = np.stack(coords, axis=-1)
+
+ coords = coords.reshape(-1, 2).astype(np.float32)
+ flow = flow.reshape(-1, 2).astype(np.float32)
+ valid = np.isfinite(flow[:, 0])
+
+ coords0 = coords[valid]
+ flow0 = flow[valid]
+
+ ht1 = int(round(ht * fy / factor))
+ wd1 = int(round(wd * fx / factor))
+
+ rescale = np.expand_dims(np.array([fx, fy]), axis=0)
+ coords1 = coords0 * rescale / factor
+ flow1 = flow0 * rescale
+
+ xx = np.round(coords1[:, 0]).astype(np.int32)
+ yy = np.round(coords1[:, 1]).astype(np.int32)
+
+ v = (xx > 0) & (xx < wd1) & (yy > 0) & (yy < ht1)
+ xx = xx[v]
+ yy = yy[v]
+ flow1 = flow1[v]
+
+ flow = np.inf * np.ones(
+ [ht1, wd1, 2], dtype=np.float32
+ ) # invalid value every where, before we fill it with the correct ones
+ flow[yy, xx] = flow1
+ return flow
+
+ def spatial_transform(self, img1, img2, flow, dname):
+ if np.random.rand() < self.spatial_aug_prob:
+ # randomly sample scale
+ ht, wd = img1.shape[:2]
+ clip_min_scale = np.maximum(
+ (self.crop_size[0] + 8) / float(ht), (self.crop_size[1] + 8) / float(wd)
+ )
+ min_scale, max_scale = self.min_scale, self.max_scale
+ scale = 2 ** np.random.uniform(self.min_scale, self.max_scale)
+ scale_x = scale
+ scale_y = scale
+ if np.random.rand() < self.stretch_prob:
+ scale_x *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)
+ scale_y *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)
+ scale_x = np.clip(scale_x, clip_min_scale, None)
+ scale_y = np.clip(scale_y, clip_min_scale, None)
+ # rescale the images
+ img1 = cv2.resize(
+ img1, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
+ )
+ img2 = cv2.resize(
+ img2, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
+ )
+ flow = self._resize_flow(
+ flow, scale_x, scale_y, factor=2.0 if dname == "Spring" else 1.0
+ )
+ elif dname == "Spring":
+ flow = self._resize_flow(flow, 1.0, 1.0, factor=2.0)
+
+ if self.h_flip_prob > 0.0 and np.random.rand() < self.h_flip_prob: # h-flip
+ img1 = img1[:, ::-1]
+ img2 = img2[:, ::-1]
+ flow = flow[:, ::-1] * [-1.0, 1.0]
+
+ if self.v_flip_prob > 0.0 and np.random.rand() < self.v_flip_prob: # v-flip
+ img1 = img1[::-1, :]
+ img2 = img2[::-1, :]
+ flow = flow[::-1, :] * [1.0, -1.0]
+
+ # In case no cropping
+ if img1.shape[0] - self.crop_size[0] > 0:
+ y0 = np.random.randint(0, img1.shape[0] - self.crop_size[0])
+ else:
+ y0 = 0
+ if img1.shape[1] - self.crop_size[1] > 0:
+ x0 = np.random.randint(0, img1.shape[1] - self.crop_size[1])
+ else:
+ x0 = 0
+
+ img1 = img1[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
+ img2 = img2[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
+ flow = flow[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
+
+ return img1, img2, flow
+
+ def __call__(self, img1, img2, flow, dname):
+ img1, img2, flow = self.spatial_transform(img1, img2, flow, dname)
+ img1, img2 = self.color_transform(img1, img2)
+ img1 = np.ascontiguousarray(img1)
+ img2 = np.ascontiguousarray(img2)
+ flow = np.ascontiguousarray(flow)
+ return img1, img2, flow
diff --git a/third_party/dust3r/croco/stereoflow/criterion.py b/third_party/dust3r/croco/stereoflow/criterion.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ce56f6e10a63185b325730eca151076ae7e47a2
--- /dev/null
+++ b/third_party/dust3r/croco/stereoflow/criterion.py
@@ -0,0 +1,346 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+# --------------------------------------------------------
+# Losses, metrics per batch, metrics per dataset
+# --------------------------------------------------------
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+
+
+def _get_gtnorm(gt):
+ if gt.size(1) == 1: # stereo
+ return gt
+ # flow
+ return torch.sqrt(torch.sum(gt**2, dim=1, keepdims=True)) # Bx1xHxW
+
+
+############ losses without confidence
+
+
+class L1Loss(nn.Module):
+ def __init__(self, max_gtnorm=None):
+ super().__init__()
+ self.max_gtnorm = max_gtnorm
+ self.with_conf = False
+
+ def _error(self, gt, predictions):
+ return torch.abs(gt - predictions)
+
+ def forward(self, predictions, gt, inspect=False):
+ mask = torch.isfinite(gt)
+ if self.max_gtnorm is not None:
+ mask *= _get_gtnorm(gt).expand(-1, gt.size(1), -1, -1) < self.max_gtnorm
+ if inspect:
+ return self._error(gt, predictions)
+ return self._error(gt[mask], predictions[mask]).mean()
+
+
+############## losses with confience
+## there are several parametrizations
+
+
+class LaplacianLoss(nn.Module): # used for CroCo-Stereo on ETH3D, d'=exp(d)
+ def __init__(self, max_gtnorm=None):
+ super().__init__()
+ self.max_gtnorm = max_gtnorm
+ self.with_conf = True
+
+ def forward(self, predictions, gt, conf):
+ mask = torch.isfinite(gt)
+ mask = mask[:, 0, :, :]
+ if self.max_gtnorm is not None:
+ mask *= _get_gtnorm(gt)[:, 0, :, :] < self.max_gtnorm
+ conf = conf.squeeze(1)
+ return (
+ torch.abs(gt - predictions).sum(dim=1)[mask] / torch.exp(conf[mask])
+ + conf[mask]
+ ).mean() # + torch.log(2) => which is a constant
+
+
+class LaplacianLossBounded(
+ nn.Module
+): # used for CroCo-Flow ; in the equation of the paper, we have a=1/b
+ def __init__(self, max_gtnorm=10000.0, a=0.25, b=4.0):
+ super().__init__()
+ self.max_gtnorm = max_gtnorm
+ self.with_conf = True
+ self.a, self.b = a, b
+
+ def forward(self, predictions, gt, conf):
+ mask = torch.isfinite(gt)
+ mask = mask[:, 0, :, :]
+ if self.max_gtnorm is not None:
+ mask *= _get_gtnorm(gt)[:, 0, :, :] < self.max_gtnorm
+ conf = conf.squeeze(1)
+ conf = (self.b - self.a) * torch.sigmoid(conf) + self.a
+ return (
+ torch.abs(gt - predictions).sum(dim=1)[mask] / conf[mask]
+ + torch.log(conf)[mask]
+ ).mean() # + torch.log(2) => which is a constant
+
+
+class LaplacianLossBounded2(
+ nn.Module
+): # used for CroCo-Stereo (except for ETH3D) ; in the equation of the paper, we have a=b
+ def __init__(self, max_gtnorm=None, a=3.0, b=3.0):
+ super().__init__()
+ self.max_gtnorm = max_gtnorm
+ self.with_conf = True
+ self.a, self.b = a, b
+
+ def forward(self, predictions, gt, conf):
+ mask = torch.isfinite(gt)
+ mask = mask[:, 0, :, :]
+ if self.max_gtnorm is not None:
+ mask *= _get_gtnorm(gt)[:, 0, :, :] < self.max_gtnorm
+ conf = conf.squeeze(1)
+ conf = 2 * self.a * (torch.sigmoid(conf / self.b) - 0.5)
+ return (
+ torch.abs(gt - predictions).sum(dim=1)[mask] / torch.exp(conf[mask])
+ + conf[mask]
+ ).mean() # + torch.log(2) => which is a constant
+
+
+############## metrics per batch
+
+
+class StereoMetrics(nn.Module):
+ def __init__(self, do_quantile=False):
+ super().__init__()
+ self.bad_ths = [0.5, 1, 2, 3]
+ self.do_quantile = do_quantile
+
+ def forward(self, predictions, gt):
+ B = predictions.size(0)
+ metrics = {}
+ gtcopy = gt.clone()
+ mask = torch.isfinite(gtcopy)
+ gtcopy[
+ ~mask
+ ] = 999999.0 # we make a copy and put a non-infinite value, such that it does not become nan once multiplied by the mask value 0
+ Npx = mask.view(B, -1).sum(dim=1)
+ L1error = (torch.abs(gtcopy - predictions) * mask).view(B, -1)
+ L2error = (torch.square(gtcopy - predictions) * mask).view(B, -1)
+ # avgerr
+ metrics["avgerr"] = torch.mean(L1error.sum(dim=1) / Npx)
+ # rmse
+ metrics["rmse"] = torch.sqrt(L2error.sum(dim=1) / Npx).mean(dim=0)
+ # err > t for t in [0.5,1,2,3]
+ for ths in self.bad_ths:
+ metrics["bad@{:.1f}".format(ths)] = (
+ ((L1error > ths) * mask.view(B, -1)).sum(dim=1) / Npx
+ ).mean(dim=0) * 100
+ return metrics
+
+
+class FlowMetrics(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.bad_ths = [1, 3, 5]
+
+ def forward(self, predictions, gt):
+ B = predictions.size(0)
+ metrics = {}
+ mask = torch.isfinite(gt[:, 0, :, :]) # both x and y would be infinite
+ Npx = mask.view(B, -1).sum(dim=1)
+ gtcopy = (
+ gt.clone()
+ ) # to compute L1/L2 error, we need to have non-infinite value, the error computed at this locations will be ignored
+ gtcopy[:, 0, :, :][~mask] = 999999.0
+ gtcopy[:, 1, :, :][~mask] = 999999.0
+ L1error = (torch.abs(gtcopy - predictions).sum(dim=1) * mask).view(B, -1)
+ L2error = (
+ torch.sqrt(torch.sum(torch.square(gtcopy - predictions), dim=1)) * mask
+ ).view(B, -1)
+ metrics["L1err"] = torch.mean(L1error.sum(dim=1) / Npx)
+ metrics["EPE"] = torch.mean(L2error.sum(dim=1) / Npx)
+ for ths in self.bad_ths:
+ metrics["bad@{:.1f}".format(ths)] = (
+ ((L2error > ths) * mask.view(B, -1)).sum(dim=1) / Npx
+ ).mean(dim=0) * 100
+ return metrics
+
+
+############## metrics per dataset
+## we update the average and maintain the number of pixels while adding data batch per batch
+## at the beggining, call reset()
+## after each batch, call add_batch(...)
+## at the end: call get_results()
+
+
+class StereoDatasetMetrics(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.bad_ths = [0.5, 1, 2, 3]
+
+ def reset(self):
+ self.agg_N = 0 # number of pixels so far
+ self.agg_L1err = torch.tensor(0.0) # L1 error so far
+ self.agg_Nbad = [0 for _ in self.bad_ths] # counter of bad pixels
+ self._metrics = None
+
+ def add_batch(self, predictions, gt):
+ assert predictions.size(1) == 1, predictions.size()
+ assert gt.size(1) == 1, gt.size()
+ if (
+ gt.size(2) == predictions.size(2) * 2
+ and gt.size(3) == predictions.size(3) * 2
+ ): # special case for Spring ...
+ L1err = torch.minimum(
+ torch.minimum(
+ torch.minimum(
+ torch.sum(torch.abs(gt[:, :, 0::2, 0::2] - predictions), dim=1),
+ torch.sum(torch.abs(gt[:, :, 1::2, 0::2] - predictions), dim=1),
+ ),
+ torch.sum(torch.abs(gt[:, :, 0::2, 1::2] - predictions), dim=1),
+ ),
+ torch.sum(torch.abs(gt[:, :, 1::2, 1::2] - predictions), dim=1),
+ )
+ valid = torch.isfinite(L1err)
+ else:
+ valid = torch.isfinite(gt[:, 0, :, :]) # both x and y would be infinite
+ L1err = torch.sum(torch.abs(gt - predictions), dim=1)
+ N = valid.sum()
+ Nnew = self.agg_N + N
+ self.agg_L1err = (
+ float(self.agg_N) / Nnew * self.agg_L1err
+ + L1err[valid].mean().cpu() * float(N) / Nnew
+ )
+ self.agg_N = Nnew
+ for i, th in enumerate(self.bad_ths):
+ self.agg_Nbad[i] += (L1err[valid] > th).sum().cpu()
+
+ def _compute_metrics(self):
+ if self._metrics is not None:
+ return
+ out = {}
+ out["L1err"] = self.agg_L1err.item()
+ for i, th in enumerate(self.bad_ths):
+ out["bad@{:.1f}".format(th)] = (
+ float(self.agg_Nbad[i]) / self.agg_N
+ ).item() * 100.0
+ self._metrics = out
+
+ def get_results(self):
+ self._compute_metrics() # to avoid recompute them multiple times
+ return self._metrics
+
+
+class FlowDatasetMetrics(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.bad_ths = [0.5, 1, 3, 5]
+ self.speed_ths = [(0, 10), (10, 40), (40, torch.inf)]
+
+ def reset(self):
+ self.agg_N = 0 # number of pixels so far
+ self.agg_L1err = torch.tensor(0.0) # L1 error so far
+ self.agg_L2err = torch.tensor(0.0) # L2 (=EPE) error so far
+ self.agg_Nbad = [0 for _ in self.bad_ths] # counter of bad pixels
+ self.agg_EPEspeed = [
+ torch.tensor(0.0) for _ in self.speed_ths
+ ] # EPE per speed bin so far
+ self.agg_Nspeed = [0 for _ in self.speed_ths] # N pixels per speed bin so far
+ self._metrics = None
+ self.pairname_results = {}
+
+ def add_batch(self, predictions, gt):
+ assert predictions.size(1) == 2, predictions.size()
+ assert gt.size(1) == 2, gt.size()
+ if (
+ gt.size(2) == predictions.size(2) * 2
+ and gt.size(3) == predictions.size(3) * 2
+ ): # special case for Spring ...
+ L1err = torch.minimum(
+ torch.minimum(
+ torch.minimum(
+ torch.sum(torch.abs(gt[:, :, 0::2, 0::2] - predictions), dim=1),
+ torch.sum(torch.abs(gt[:, :, 1::2, 0::2] - predictions), dim=1),
+ ),
+ torch.sum(torch.abs(gt[:, :, 0::2, 1::2] - predictions), dim=1),
+ ),
+ torch.sum(torch.abs(gt[:, :, 1::2, 1::2] - predictions), dim=1),
+ )
+ L2err = torch.minimum(
+ torch.minimum(
+ torch.minimum(
+ torch.sqrt(
+ torch.sum(
+ torch.square(gt[:, :, 0::2, 0::2] - predictions), dim=1
+ )
+ ),
+ torch.sqrt(
+ torch.sum(
+ torch.square(gt[:, :, 1::2, 0::2] - predictions), dim=1
+ )
+ ),
+ ),
+ torch.sqrt(
+ torch.sum(
+ torch.square(gt[:, :, 0::2, 1::2] - predictions), dim=1
+ )
+ ),
+ ),
+ torch.sqrt(
+ torch.sum(torch.square(gt[:, :, 1::2, 1::2] - predictions), dim=1)
+ ),
+ )
+ valid = torch.isfinite(L1err)
+ gtspeed = (
+ torch.sqrt(torch.sum(torch.square(gt[:, :, 0::2, 0::2]), dim=1))
+ + torch.sqrt(torch.sum(torch.square(gt[:, :, 0::2, 1::2]), dim=1))
+ + torch.sqrt(torch.sum(torch.square(gt[:, :, 1::2, 0::2]), dim=1))
+ + torch.sqrt(torch.sum(torch.square(gt[:, :, 1::2, 1::2]), dim=1))
+ ) / 4.0 # let's just average them
+ else:
+ valid = torch.isfinite(gt[:, 0, :, :]) # both x and y would be infinite
+ L1err = torch.sum(torch.abs(gt - predictions), dim=1)
+ L2err = torch.sqrt(torch.sum(torch.square(gt - predictions), dim=1))
+ gtspeed = torch.sqrt(torch.sum(torch.square(gt), dim=1))
+ N = valid.sum()
+ Nnew = self.agg_N + N
+ self.agg_L1err = (
+ float(self.agg_N) / Nnew * self.agg_L1err
+ + L1err[valid].mean().cpu() * float(N) / Nnew
+ )
+ self.agg_L2err = (
+ float(self.agg_N) / Nnew * self.agg_L2err
+ + L2err[valid].mean().cpu() * float(N) / Nnew
+ )
+ self.agg_N = Nnew
+ for i, th in enumerate(self.bad_ths):
+ self.agg_Nbad[i] += (L2err[valid] > th).sum().cpu()
+ for i, (th1, th2) in enumerate(self.speed_ths):
+ vv = (gtspeed[valid] >= th1) * (gtspeed[valid] < th2)
+ iNspeed = vv.sum()
+ if iNspeed == 0:
+ continue
+ iNnew = self.agg_Nspeed[i] + iNspeed
+ self.agg_EPEspeed[i] = (
+ float(self.agg_Nspeed[i]) / iNnew * self.agg_EPEspeed[i]
+ + float(iNspeed) / iNnew * L2err[valid][vv].mean().cpu()
+ )
+ self.agg_Nspeed[i] = iNnew
+
+ def _compute_metrics(self):
+ if self._metrics is not None:
+ return
+ out = {}
+ out["L1err"] = self.agg_L1err.item()
+ out["EPE"] = self.agg_L2err.item()
+ for i, th in enumerate(self.bad_ths):
+ out["bad@{:.1f}".format(th)] = (
+ float(self.agg_Nbad[i]) / self.agg_N
+ ).item() * 100.0
+ for i, (th1, th2) in enumerate(self.speed_ths):
+ out[
+ "s{:d}{:s}".format(th1, "-" + str(th2) if th2 < torch.inf else "+")
+ ] = self.agg_EPEspeed[i].item()
+ self._metrics = out
+
+ def get_results(self):
+ self._compute_metrics() # to avoid recompute them multiple times
+ return self._metrics
diff --git a/third_party/dust3r/croco/stereoflow/datasets_flow.py b/third_party/dust3r/croco/stereoflow/datasets_flow.py
new file mode 100644
index 0000000000000000000000000000000000000000..322a745059f7d226f7ba9e60de8bcd7f18c794a4
--- /dev/null
+++ b/third_party/dust3r/croco/stereoflow/datasets_flow.py
@@ -0,0 +1,929 @@
+# Copyright (C) 2022-present Naver Corporation. All rights reserved.
+# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).
+
+# --------------------------------------------------------
+# Dataset structure for flow
+# --------------------------------------------------------
+
+import json
+import os
+import os.path as osp
+import pickle
+import struct
+from copy import deepcopy
+
+import h5py
+import numpy as np
+import torch
+from PIL import Image
+from torch.utils import data
+
+from .augmentor import FlowAugmentor
+from .datasets_stereo import _read_img, _read_pfm, dataset_to_root, img_to_tensor
+
+dataset_to_root = deepcopy(dataset_to_root)
+
+dataset_to_root.update(
+ **{
+ "TartanAir": "./data/stereoflow/TartanAir",
+ "FlyingChairs": "./data/stereoflow/FlyingChairs/",
+ "FlyingThings": osp.join(dataset_to_root["SceneFlow"], "FlyingThings") + "/",
+ "MPISintel": "./data/stereoflow//MPI-Sintel/" + "/",
+ }
+)
+cache_dir = "./data/stereoflow/datasets_flow_cache/"
+
+
+def flow_to_tensor(disp):
+ return torch.from_numpy(disp).float().permute(2, 0, 1)
+
+
+class FlowDataset(data.Dataset):
+ def __init__(self, split, augmentor=False, crop_size=None, totensor=True):
+ self.split = split
+ if not augmentor:
+ assert crop_size is None
+ if crop_size is not None:
+ assert augmentor
+ self.crop_size = crop_size
+ self.augmentor_str = augmentor
+ self.augmentor = FlowAugmentor(crop_size) if augmentor else None
+ self.totensor = totensor
+ self.rmul = 1 # keep track of rmul
+ self.has_constant_resolution = True # whether the dataset has constant resolution or not (=> don't use batch_size>1 at test time)
+ self._prepare_data()
+ self._load_or_build_cache()
+
+ def prepare_data(self):
+ """
+ to be defined for each dataset
+ """
+ raise NotImplementedError
+
+ def __len__(self):
+ return len(
+ self.pairnames
+ ) # each pairname is typically of the form (str, int1, int2)
+
+ def __getitem__(self, index):
+ pairname = self.pairnames[index]
+
+ # get filenames
+ img1name = self.pairname_to_img1name(pairname)
+ img2name = self.pairname_to_img2name(pairname)
+ flowname = (
+ self.pairname_to_flowname(pairname)
+ if self.pairname_to_flowname is not None
+ else None
+ )
+
+ # load images and disparities
+ img1 = _read_img(img1name)
+ img2 = _read_img(img2name)
+ flow = self.load_flow(flowname) if flowname is not None else None
+
+ # apply augmentations
+ if self.augmentor is not None:
+ img1, img2, flow = self.augmentor(img1, img2, flow, self.name)
+
+ if self.totensor:
+ img1 = img_to_tensor(img1)
+ img2 = img_to_tensor(img2)
+ if flow is not None:
+ flow = flow_to_tensor(flow)
+ else:
+ flow = torch.tensor(
+ []
+ ) # to allow dataloader batching with default collate_gn
+ pairname = str(
+ pairname
+ ) # transform potential tuple to str to be able to batch it
+
+ return img1, img2, flow, pairname
+
+ def __rmul__(self, v):
+ self.rmul *= v
+ self.pairnames = v * self.pairnames
+ return self
+
+ def __str__(self):
+ return f"{self.__class__.__name__}_{self.split}"
+
+ def __repr__(self):
+ s = f"{self.__class__.__name__}(split={self.split}, augmentor={self.augmentor_str}, crop_size={str(self.crop_size)}, totensor={self.totensor})"
+ if self.rmul == 1:
+ s += f"\n\tnum pairs: {len(self.pairnames)}"
+ else:
+ s += f"\n\tnum pairs: {len(self.pairnames)} ({len(self.pairnames)//self.rmul}x{self.rmul})"
+ return s
+
+ def _set_root(self):
+ self.root = dataset_to_root[self.name]
+ assert os.path.isdir(
+ self.root
+ ), f"could not find root directory for dataset {self.name}: {self.root}"
+
+ def _load_or_build_cache(self):
+ cache_file = osp.join(cache_dir, self.name + ".pkl")
+ if osp.isfile(cache_file):
+ with open(cache_file, "rb") as fid:
+ self.pairnames = pickle.load(fid)[self.split]
+ else:
+ tosave = self._build_cache()
+ os.makedirs(cache_dir, exist_ok=True)
+ with open(cache_file, "wb") as fid:
+ pickle.dump(tosave, fid)
+ self.pairnames = tosave[self.split]
+
+
+class TartanAirDataset(FlowDataset):
+ def _prepare_data(self):
+ self.name = "TartanAir"
+ self._set_root()
+ assert self.split in ["train"]
+ self.pairname_to_img1name = lambda pairname: osp.join(
+ self.root, pairname[0], "image_left/{:06d}_left.png".format(pairname[1])
+ )
+ self.pairname_to_img2name = lambda pairname: osp.join(
+ self.root, pairname[0], "image_left/{:06d}_left.png".format(pairname[2])
+ )
+ self.pairname_to_flowname = lambda pairname: osp.join(
+ self.root,
+ pairname[0],
+ "flow/{:06d}_{:06d}_flow.npy".format(pairname[1], pairname[2]),
+ )
+ self.pairname_to_str = lambda pairname: os.path.join(
+ pairname[0][pairname[0].find("/") + 1 :],
+ "{:06d}_{:06d}".format(pairname[1], pairname[2]),
+ )
+ self.load_flow = _read_numpy_flow
+
+ def _build_cache(self):
+ seqs = sorted(os.listdir(self.root))
+ pairs = [
+ (osp.join(s, s, difficulty, Pxxx), int(a[:6]), int(a[:6]) + 1)
+ for s in seqs
+ for difficulty in ["Easy", "Hard"]
+ for Pxxx in sorted(os.listdir(osp.join(self.root, s, s, difficulty)))
+ for a in sorted(
+ os.listdir(osp.join(self.root, s, s, difficulty, Pxxx, "image_left/"))
+ )[:-1]
+ ]
+ assert len(pairs) == 306268, "incorrect parsing of pairs in TartanAir"
+ tosave = {"train": pairs}
+ return tosave
+
+
+class FlyingChairsDataset(FlowDataset):
+ def _prepare_data(self):
+ self.name = "FlyingChairs"
+ self._set_root()
+ assert self.split in ["train", "val"]
+ self.pairname_to_img1name = lambda pairname: osp.join(
+ self.root, "data", pairname + "_img1.ppm"
+ )
+ self.pairname_to_img2name = lambda pairname: osp.join(
+ self.root, "data", pairname + "_img2.ppm"
+ )
+ self.pairname_to_flowname = lambda pairname: osp.join(
+ self.root, "data", pairname + "_flow.flo"
+ )
+ self.pairname_to_str = lambda pairname: pairname
+ self.load_flow = _read_flo_file
+
+ def _build_cache(self):
+ split_file = osp.join(self.root, "chairs_split.txt")
+ split_list = np.loadtxt(split_file, dtype=np.int32)
+ trainpairs = ["{:05d}".format(i) for i in np.where(split_list == 1)[0] + 1]
+ valpairs = ["{:05d}".format(i) for i in np.where(split_list == 2)[0] + 1]
+ assert (
+ len(trainpairs) == 22232 and len(valpairs) == 640
+ ), "incorrect parsing of pairs in MPI-Sintel"
+ tosave = {"train": trainpairs, "val": valpairs}
+ return tosave
+
+
+class FlyingThingsDataset(FlowDataset):
+ def _prepare_data(self):
+ self.name = "FlyingThings"
+ self._set_root()
+ assert self.split in [
+ f"{set_}_{pass_}pass{camstr}"
+ for set_ in ["train", "test", "test1024"]
+ for camstr in ["", "_rightcam"]
+ for pass_ in ["clean", "final", "all"]
+ ]
+ self.pairname_to_img1name = lambda pairname: osp.join(
+ self.root,
+ f"frames_{pairname[3]}pass",
+ pairname[0].replace("into_future", "").replace("into_past", ""),
+ "{:04d}.png".format(pairname[1]),
+ )
+ self.pairname_to_img2name = lambda pairname: osp.join(
+ self.root,
+ f"frames_{pairname[3]}pass",
+ pairname[0].replace("into_future", "").replace("into_past", ""),
+ "{:04d}.png".format(pairname[2]),
+ )
+ self.pairname_to_flowname = lambda pairname: osp.join(
+ self.root,
+ "optical_flow",
+ pairname[0],
+ "OpticalFlowInto{f:s}_{i:04d}_{c:s}.pfm".format(
+ f="Future" if "future" in pairname[0] else "Past",
+ i=pairname[1],
+ c="L" if "left" in pairname[0] else "R",
+ ),
+ )
+ self.pairname_to_str = lambda pairname: os.path.join(
+ pairname[3] + "pass",
+ pairname[0],
+ "Into{f:s}_{i:04d}_{c:s}".format(
+ f="Future" if "future" in pairname[0] else "Past",
+ i=pairname[1],
+ c="L" if "left" in pairname[0] else "R",
+ ),
+ )
+ self.load_flow = _read_pfm_flow
+
+ def _build_cache(self):
+ tosave = {}
+ # train and test splits for the different passes
+ for set_ in ["train", "test"]:
+ sroot = osp.join(self.root, "optical_flow", set_.upper())
+ fname_to_i = lambda f: int(
+ f[len("OpticalFlowIntoFuture_") : -len("_L.pfm")]
+ )
+ pp = [
+ (osp.join(set_.upper(), d, s, "into_future/left"), fname_to_i(fname))
+ for d in sorted(os.listdir(sroot))
+ for s in sorted(os.listdir(osp.join(sroot, d)))
+ for fname in sorted(
+ os.listdir(osp.join(sroot, d, s, "into_future/left"))
+ )[:-1]
+ ]
+ pairs = [(a, i, i + 1) for a, i in pp]
+ pairs += [(a.replace("into_future", "into_past"), i + 1, i) for a, i in pp]
+ assert (
+ len(pairs) == {"train": 40302, "test": 7866}[set_]
+ ), "incorrect parsing of pairs Flying Things"
+ for cam in ["left", "right"]:
+ camstr = "" if cam == "left" else f"_{cam}cam"
+ for pass_ in ["final", "clean"]:
+ tosave[f"{set_}_{pass_}pass{camstr}"] = [
+ (a.replace("left", cam), i, j, pass_) for a, i, j in pairs
+ ]
+ tosave[f"{set_}_allpass{camstr}"] = (
+ tosave[f"{set_}_cleanpass{camstr}"]
+ + tosave[f"{set_}_finalpass{camstr}"]
+ )
+ # test1024: this is the same split as unimatch 'validation' split
+ # see https://github.com/autonomousvision/unimatch/blob/master/dataloader/flow/datasets.py#L229
+ test1024_nsamples = 1024
+ alltest_nsamples = len(tosave["test_cleanpass"]) # 7866
+ stride = alltest_nsamples // test1024_nsamples
+ remove = alltest_nsamples % test1024_nsamples
+ for cam in ["left", "right"]:
+ camstr = "" if cam == "left" else f"_{cam}cam"
+ for pass_ in ["final", "clean"]:
+ tosave[f"test1024_{pass_}pass{camstr}"] = sorted(
+ tosave[f"test_{pass_}pass{camstr}"]
+ )[:-remove][
+ ::stride
+ ] # warning, it was not sorted before
+ assert (
+ len(tosave["test1024_cleanpass"]) == 1024
+ ), "incorrect parsing of pairs in Flying Things"
+ tosave[f"test1024_allpass{camstr}"] = (
+ tosave[f"test1024_cleanpass{camstr}"]
+ + tosave[f"test1024_finalpass{camstr}"]
+ )
+ return tosave
+
+
+class MPISintelDataset(FlowDataset):
+ def _prepare_data(self):
+ self.name = "MPISintel"
+ self._set_root()
+ assert self.split in [
+ s + "_" + p
+ for s in ["train", "test", "subval", "subtrain"]
+ for p in ["cleanpass", "finalpass", "allpass"]
+ ]
+ self.pairname_to_img1name = lambda pairname: osp.join(
+ self.root, pairname[0], "frame_{:04d}.png".format(pairname[1])
+ )
+ self.pairname_to_img2name = lambda pairname: osp.join(
+ self.root, pairname[0], "frame_{:04d}.png".format(pairname[1] + 1)
+ )
+ self.pairname_to_flowname = (
+ lambda pairname: None
+ if pairname[0].startswith("test/")
+ else osp.join(
+ self.root,
+ pairname[0].replace("/clean/", "/flow/").replace("/final/", "/flow/"),
+ "frame_{:04d}.flo".format(pairname[1]),
+ )
+ )
+ self.pairname_to_str = lambda pairname: osp.join(
+ pairname[0], "frame_{:04d}".format(pairname[1])
+ )
+ self.load_flow = _read_flo_file
+
+ def _build_cache(self):
+ trainseqs = sorted(os.listdir(self.root + "training/clean"))
+ trainpairs = [
+ (osp.join("training/clean", s), i)
+ for s in trainseqs
+ for i in range(1, len(os.listdir(self.root + "training/clean/" + s)))
+ ]
+ subvalseqs = ["temple_2", "temple_3"]
+ subtrainseqs = [s for s in trainseqs if s not in subvalseqs]
+ subvalpairs = [(p, i) for p, i in trainpairs if any(s in p for s in subvalseqs)]
+ subtrainpairs = [
+ (p, i) for p, i in trainpairs if any(s in p for s in subtrainseqs)
+ ]
+ testseqs = sorted(os.listdir(self.root + "test/clean"))
+ testpairs = [
+ (osp.join("test/clean", s), i)
+ for s in testseqs
+ for i in range(1, len(os.listdir(self.root + "test/clean/" + s)))
+ ]
+ assert (
+ len(trainpairs) == 1041
+ and len(testpairs) == 552
+ and len(subvalpairs) == 98
+ and len(subtrainpairs) == 943
+ ), "incorrect parsing of pairs in MPI-Sintel"
+ tosave = {}
+ tosave["train_cleanpass"] = trainpairs
+ tosave["test_cleanpass"] = testpairs
+ tosave["subval_cleanpass"] = subvalpairs
+ tosave["subtrain_cleanpass"] = subtrainpairs
+ for t in ["train", "test", "subval", "subtrain"]:
+ tosave[t + "_finalpass"] = [
+ (p.replace("/clean/", "/final/"), i)
+ for p, i in tosave[t + "_cleanpass"]
+ ]
+ tosave[t + "_allpass"] = tosave[t + "_cleanpass"] + tosave[t + "_finalpass"]
+ return tosave
+
+ def submission_save_pairname(self, pairname, prediction, outdir, _time):
+ assert prediction.shape[2] == 2
+ outfile = os.path.join(
+ outdir, "submission", self.pairname_to_str(pairname) + ".flo"
+ )
+ os.makedirs(os.path.dirname(outfile), exist_ok=True)
+ writeFlowFile(prediction, outfile)
+
+ def finalize_submission(self, outdir):
+ assert self.split == "test_allpass"
+ bundle_exe = "/nfs/data/ffs-3d/datasets/StereoFlow/MPI-Sintel/bundler/linux-x64/bundler" # eg