Spaces:
Runtime error
Runtime error
<!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# Monocular depth estimation | |
Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a | |
single image. In other words, it is the process of estimating the distance of objects in a scene from | |
a single camera viewpoint. | |
Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, | |
and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects | |
in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, | |
occlusion, and texture. | |
<Tip> | |
The task illustrated in this tutorial is supported by the following model architectures: | |
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!--> | |
[DPT](../model_doc/dpt), [GLPN](../model_doc/glpn) | |
<!--End of the generated tip--> | |
</Tip> | |
In this guide you'll learn how to: | |
* create a depth estimation pipeline | |
* run depth estimation inference by hand | |
Before you begin, make sure you have all the necessary libraries installed: | |
```bash | |
pip install -q transformers | |
``` | |
## Depth estimation pipeline | |
The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`]. | |
Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads): | |
```py | |
>>> from transformers import pipeline | |
>>> checkpoint = "vinvino02/glpn-nyu" | |
>>> depth_estimator = pipeline("depth-estimation", model=checkpoint) | |
``` | |
Next, choose an image to analyze: | |
```py | |
>>> from PIL import Image | |
>>> import requests | |
>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640" | |
>>> image = Image.open(requests.get(url, stream=True).raw) | |
>>> image | |
``` | |
<div class="flex justify-center"> | |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-estimation-example.jpg" alt="Photo of a busy street"/> | |
</div> | |
Pass the image to the pipeline. | |
```py | |
>>> predictions = depth_estimator(image) | |
``` | |
The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values | |
being the depth expressed in meters for each pixel. | |
The second one, `depth`, is a PIL image that visualizes the depth estimation result. | |
Let's take a look at the visualized result: | |
```py | |
>>> predictions["depth"] | |
``` | |
<div class="flex justify-center"> | |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/> | |
</div> | |
## Depth estimation inference by hand | |
Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand. | |
Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads). | |
Here we'll use the same checkpoint as before: | |
```py | |
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation | |
>>> checkpoint = "vinvino02/glpn-nyu" | |
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) | |
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint) | |
``` | |
Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations | |
such as resizing and normalization: | |
```py | |
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values | |
``` | |
Pass the prepared inputs through the model: | |
```py | |
>>> import torch | |
>>> with torch.no_grad(): | |
... outputs = model(pixel_values) | |
... predicted_depth = outputs.predicted_depth | |
``` | |
Visualize the results: | |
```py | |
>>> import numpy as np | |
>>> # interpolate to original size | |
>>> prediction = torch.nn.functional.interpolate( | |
... predicted_depth.unsqueeze(1), | |
... size=image.size[::-1], | |
... mode="bicubic", | |
... align_corners=False, | |
... ).squeeze() | |
>>> output = prediction.numpy() | |
>>> formatted = (output * 255 / np.max(output)).astype("uint8") | |
>>> depth = Image.fromarray(formatted) | |
>>> depth | |
``` | |
<div class="flex justify-center"> | |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/> | |
</div> | |