Transformers documentation


You are viewing v4.48.0 version. A newer version v4.50.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started



The VitPose model was proposed in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao. VitPose employs a standard, non-hierarchical Vision Transformer as backbone for the task of keypoint estimation. A simple decoder head is added on top to predict the heatmaps from a given image. Despite its simplicity, the model gets state-of-the-art results on the challenging MS COCO Keypoint Detection benchmark.

The abstract from the paper is the following:

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art.


This model was contributed by nielsr and sangbumchoi. The original code can be found here.

Usage Tips

ViTPose is a so-called top-down keypoint detection model. This means that one first uses an object detector, like RT-DETR, to detect people (or other instances) in an image. Next, ViTPose takes the cropped images as input and predicts the keypoints.

import torch
import requests
import numpy as np

from PIL import Image

from transformers import (

device = "cuda" if torch.cuda.is_available() else "cpu"

url = ""
image =, stream=True).raw)

# ------------------------------------------------------------------------
# Stage 1. Detect humans on the image
# ------------------------------------------------------------------------

# You can choose detector by your choice
person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365")
person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device)

inputs = person_image_processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = person_model(**inputs)

results = person_image_processor.post_process_object_detection(
    outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3
result = results[0]  # take first image results

# Human label refers 0 index in COCO dataset
person_boxes = result["boxes"][result["labels"] == 0]
person_boxes = person_boxes.cpu().numpy()

# Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format
person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0]
person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1]

# ------------------------------------------------------------------------
# Stage 2. Detect keypoints for each person found
# ------------------------------------------------------------------------

image_processor = AutoProcessor.from_pretrained("usyd-community/vitpose-base-simple")
model = VitPoseForPoseEstimation.from_pretrained("usyd-community/vitpose-base-simple", device_map=device)

inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes])
image_pose_result = pose_results[0]  # results for first image

Visualization for supervision user

import supervision as sv

xy = torch.stack([pose_result['keypoints'] for pose_result in image_pose_result]).cpu().numpy()
scores = torch.stack([pose_result['scores'] for pose_result in image_pose_result]).cpu().numpy()

key_points = sv.KeyPoints(
    xy=xy, confidence=scores

edge_annotator = sv.EdgeAnnotator(
vertex_annotator = sv.VertexAnnotator(
annotated_frame = edge_annotator.annotate(
annotated_frame = vertex_annotator.annotate(

Visualization for advanced user

import math
import cv2

def draw_points(image, keypoints, scores, pose_keypoint_color, keypoint_score_threshold, radius, show_keypoint_weight):
    if pose_keypoint_color is not None:
        assert len(pose_keypoint_color) == len(keypoints)
    for kid, (kpt, kpt_score) in enumerate(zip(keypoints, scores)):
        x_coord, y_coord = int(kpt[0]), int(kpt[1])
        if kpt_score > keypoint_score_threshold:
            color = tuple(int(c) for c in pose_keypoint_color[kid])
            if show_keypoint_weight:
      , (int(x_coord), int(y_coord)), radius, color, -1)
                transparency = max(0, min(1, kpt_score))
                cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
      , (int(x_coord), int(y_coord)), radius, color, -1)

def draw_links(image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold, thickness, show_keypoint_weight, stick_width = 2):
    height, width, _ = image.shape
    if keypoint_edges is not None and link_colors is not None:
        assert len(link_colors) == len(keypoint_edges)
        for sk_id, sk in enumerate(keypoint_edges):
            x1, y1, score1 = (int(keypoints[sk[0], 0]), int(keypoints[sk[0], 1]), scores[sk[0]])
            x2, y2, score2 = (int(keypoints[sk[1], 0]), int(keypoints[sk[1], 1]), scores[sk[1]])
            if (
                x1 > 0
                and x1 < width
                and y1 > 0
                and y1 < height
                and x2 > 0
                and x2 < width
                and y2 > 0
                and y2 < height
                and score1 > keypoint_score_threshold
                and score2 > keypoint_score_threshold
                color = tuple(int(c) for c in link_colors[sk_id])
                if show_keypoint_weight:
                    X = (x1, x2)
                    Y = (y1, y2)
                    mean_x = np.mean(X)
                    mean_y = np.mean(Y)
                    length = ((Y[0] - Y[1]) ** 2 + (X[0] - X[1]) ** 2) ** 0.5
                    angle = math.degrees(math.atan2(Y[0] - Y[1], X[0] - X[1]))
                    polygon = cv2.ellipse2Poly(
                        (int(mean_x), int(mean_y)), (int(length / 2), int(stick_width)), int(angle), 0, 360, 1
                    cv2.fillConvexPoly(image, polygon, color)
                    transparency = max(0, min(1, 0.5 * (keypoints[sk[0], 2] + keypoints[sk[1], 2])))
                    cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
                    cv2.line(image, (x1, y1), (x2, y2), color, thickness=thickness)

# Note: keypoint_edges and color palette are dataset-specific
keypoint_edges = model.config.edges

palette = np.array(
        [255, 128, 0],
        [255, 153, 51],
        [255, 178, 102],
        [230, 230, 0],
        [255, 153, 255],
        [153, 204, 255],
        [255, 102, 255],
        [255, 51, 255],
        [102, 178, 255],
        [51, 153, 255],
        [255, 153, 153],
        [255, 102, 102],
        [255, 51, 51],
        [153, 255, 153],
        [102, 255, 102],
        [51, 255, 51],
        [0, 255, 0],
        [0, 0, 255],
        [255, 0, 0],
        [255, 255, 255],

link_colors = palette[[0, 0, 0, 0, 7, 7, 7, 9, 9, 9, 9, 9, 16, 16, 16, 16, 16, 16, 16]]
keypoint_colors = palette[[16, 16, 16, 16, 16, 9, 9, 9, 9, 9, 9, 0, 0, 0, 0, 0, 0]]

numpy_image = np.array(image)

for pose_result in image_pose_result:
    scores = np.array(pose_result["scores"])
    keypoints = np.array(pose_result["keypoints"])

    # draw each point on image
    draw_points(numpy_image, keypoints, scores, keypoint_colors, keypoint_score_threshold=0.3, radius=4, show_keypoint_weight=False)

    # draw links
    draw_links(numpy_image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold=0.3, thickness=1, show_keypoint_weight=False)

pose_image = Image.fromarray(numpy_image)

MoE backbone

To enable MoE (Mixture of Experts) function in the backbone, user has to give appropriate configuration such as num_experts and input value dataset_index to the backbone model. However, it is not used in default parameters. Below is the code snippet for usage of MoE function.

>>> from transformers import VitPoseBackboneConfig, VitPoseBackbone
>>> import torch

>>> config = VitPoseBackboneConfig(num_experts=3, out_indices=[-1])
>>> model = VitPoseBackbone(config)

>>> pixel_values = torch.randn(3, 3, 256, 192)
>>> dataset_index = torch.tensor([1, 2, 3])
>>> outputs = model(pixel_values, dataset_index)


class transformers.VitPoseImageProcessor

< >

( do_affine_transform: bool = True size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None **kwargs )


  • do_affine_transform (bool, optional, defaults to True) — Whether to apply an affine transformation to the input images.
  • size (Dict[str, int] optional, defaults to {"height" -- 256, "width": 192}): Resolution of the image after affine_transform is applied. Only has an effect if do_affine_transform is set to True. Can be overriden by size in the preprocess method.
  • do_rescale (bool, optional, defaults to True) — Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.).
  • rescale_factor (int or float, optional, defaults to 1/255) — Scale factor to use if rescaling the image. Can be overriden by rescale_factor in the preprocess method.
  • do_normalize (bool, optional, defaults to True) — Whether or not to normalize the input with mean and standard deviation.
  • image_mean (List[int], defaults to [0.485, 0.456, 0.406], optional) — The sequence of means for each channel, to be used when normalizing images.
  • image_std (List[int], defaults to [0.229, 0.224, 0.225], optional) — The sequence of standard deviations for each channel, to be used when normalizing images.

Constructs a VitPose image processor.


< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] boxes: typing.Union[typing.List[typing.List[float]], numpy.ndarray] do_affine_transform: bool = None size: typing.Dict[str, int] = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) BatchFeature


  • images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
  • boxes (List[List[List[float]]] or np.ndarray) — List or array of bounding boxes for each image. Each box should be a list of 4 floats representing the bounding box coordinates in COCO format (top_left_x, top_left_y, width, height).
  • do_affine_transform (bool, optional, defaults to self.do_affine_transform) — Whether to apply an affine transformation to the input images.
  • size (Dict[str, int] optional, defaults to self.size) — Dictionary in the format {"height": h, "width": w} specifying the size of the output image after resizing.
  • do_rescale (bool, optional, defaults to self.do_rescale) — Whether to rescale the image values between [0 - 1].
  • rescale_factor (float, optional, defaults to self.rescale_factor) — Rescale factor to rescale the image by if do_rescale is set to True.
  • do_normalize (bool, optional, defaults to self.do_normalize) — Whether to normalize the image.
  • image_mean (float or List[float], optional, defaults to self.image_mean) — Image mean to use if do_normalize is set to True.
  • image_std (float or List[float], optional, defaults to self.image_std) — Image standard deviation to use if do_normalize is set to True.
  • return_tensors (str or TensorType, optional, defaults to 'np') — If set, will return tensors of a particular framework. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return NumPy np.ndarray objects.
    • 'jax': Return JAX jnp.ndarray objects.



A BatchFeature with the following fields:

  • pixel_values — Pixel values to be fed to a model, of shape (batch_size, num_channels, height, width).

Preprocess an image or batch of images.


class transformers.VitPoseConfig

< >

( backbone_config: PretrainedConfig = None backbone: str = None use_pretrained_backbone: bool = False use_timm_backbone: bool = False backbone_kwargs: dict = None initializer_range: float = 0.02 scale_factor: int = 4 use_simple_decoder: bool = True **kwargs )


  • backbone_config (PretrainedConfig or dict, optional, defaults to VitPoseBackboneConfig()) — The configuration of the backbone model. Currently, only backbone_config with vitpose_backbone as model_type is supported.
  • backbone (str, optional) — Name of backbone to use when backbone_config is None. If use_pretrained_backbone is True, this will load the corresponding pretrained weights from the timm or transformers library. If use_pretrained_backbone is False, this loads the backbone’s config and uses that to initialize the backbone with random weights.
  • use_pretrained_backbone (bool, optional, defaults to False) — Whether to use pretrained weights for the backbone.
  • use_timm_backbone (bool, optional, defaults to False) — Whether to load backbone from the timm library. If False, the backbone is loaded from the transformers library.
  • backbone_kwargs (dict, optional) — Keyword arguments to be passed to AutoBackbone when loading from a checkpoint e.g. {'out_indices': (0, 1, 2, 3)}. Cannot be specified if backbone_config is set.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • scale_factor (int, optional, defaults to 4) — Factor to upscale the feature maps coming from the ViT backbone.
  • use_simple_decoder (bool, optional, defaults to True) — Whether to use a VitPoseSimpleDecoder to decode the feature maps from the backbone into heatmaps. Otherwise it uses VitPoseClassicDecoder.

This is the configuration class to store the configuration of a VitPoseForPoseEstimation. It is used to instantiate a VitPose model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VitPose usyd-community/vitpose-base-simple architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.


>>> from transformers import VitPoseConfig, VitPoseForPoseEstimation

>>> # Initializing a VitPose configuration
>>> configuration = VitPoseConfig()

>>> # Initializing a model (with random weights) from the configuration
>>> model = VitPoseForPoseEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config


class transformers.VitPoseForPoseEstimation

< >

( config: VitPoseConfig )


  • config (VitPoseConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The VitPose model with a pose estimation head on top. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.


< >

( pixel_values: Tensor dataset_index: typing.Optional[torch.Tensor] = None flip_pairs: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.models.vitpose.modeling_vitpose.VitPoseEstimatorOutput or tuple(torch.FloatTensor)


  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using VitPoseImageProcessor. See for details.
  • dataset_index (torch.Tensor of shape (batch_size,)) — Index to use in the Mixture-of-Experts (MoE) blocks of the backbone.

    This corresponds to the dataset index used during training, e.g. For the single dataset index 0 refers to the corresponding dataset. For the multiple datasets index 0 refers to dataset A (e.g. MPII) and index 1 refers to dataset B (e.g. CrowdPose).

  • flip_pairs (torch.tensor, optional) — Whether to mirror pairs of keypoints (for example, left ear — right ear).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.


transformers.models.vitpose.modeling_vitpose.VitPoseEstimatorOutput or tuple(torch.FloatTensor)

A transformers.models.vitpose.modeling_vitpose.VitPoseEstimatorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VitPoseConfig) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Loss is not supported at this moment. See for further detail.

  • heatmaps (torch.FloatTensor of shape (batch_size, num_keypoints, height, width)) — Heatmaps as predicted by the model.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The VitPoseForPoseEstimation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


>>> from transformers import AutoImageProcessor, VitPoseForPoseEstimation
>>> import torch
>>> from PIL import Image
>>> import requests

>>> processor = AutoImageProcessor.from_pretrained("usyd-community/vitpose-base-simple")
>>> model = VitPoseForPoseEstimation.from_pretrained("usyd-community/vitpose-base-simple")

>>> url = ""
>>> image =, stream=True).raw)
>>> boxes = [[[412.8, 157.61, 53.05, 138.01], [384.43, 172.21, 15.12, 35.74]]]
>>> inputs = processor(image, boxes=boxes, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)
>>> heatmaps = outputs.heatmaps
< > Update on GitHub