V-JEPA 2

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

You can find all original V-JEPA2 checkpoints under the V-JEPA 2 collection.

This model was contributed by koustuvs, yonigozlan and qubvel. The original code can be found here.

Usage example

The snippet below shows how to load the V-JEPA 2 model for feature extraction using the AutoModel class.

import torch
from torchcodec.decoders import VideoDecoder
import numpy as np

processor = AutoVideoProcessor.from_pretrained("facebook/vjepa2-vitl-fpc64-256")
model = AutoModel.from_pretrained(
    "facebook/vjepa2-vitl-fpc64-256",
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"

vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
outputs = model(**video)

# V-JEPA 2 encoder outputs, same as calling `model.get_vision_features()`
encoder_outputs = outputs.last_hidden_state

# V-JEPA 2 predictor outputs
predictor_outputs = outputs.predictor_output.last_hidden_state

V-JEPA 2 can also be finetuned for video classification. In the following snippet, we show how use finetuned on Something-Something-V2 video classification model.

import torch
import numpy as np

from torchcodec.decoders import VideoDecoder
from transformers import AutoVideoProcessor, AutoModelForVideoClassification

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and video preprocessor
hf_repo = "facebook/vjepa2-vitl-fpc16-256-ssv2"

model = AutoModelForVideoClassification.from_pretrained(hf_repo).to(device)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

# To load a video, sample the number of frames according to the model.
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/bowling/-WH-lxmGJVY_000005_000015.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, model.config.frames_per_clip, 8) # you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # frames x channels x height x width

# Preprocess and run inference
inputs = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits

print("Top 5 predicted class names:")
top5_indices = logits.topk(5).indices[0]
top5_probs = torch.softmax(logits, dim=-1).topk(5).values[0]
for idx, prob in zip(top5_indices, top5_probs):
    text_label = model.config.id2label[idx.item()]
    print(f" - {text_label}: {prob:.2f}")

VJEPA2Config

class transformers.VJEPA2Config

< source >

( patch_size = 16 crop_size = 256 frames_per_clip = 64 tubelet_size = 2 hidden_size = 1024 in_chans = 3 num_attention_heads = 16 num_hidden_layers = 24 drop_path_rate = 0.0 mlp_ratio = 4.0 layer_norm_eps = 1e-06 qkv_bias = True attention_probs_dropout_prob = 0.0 hidden_act = 'gelu' initializer_range = 0.02 attention_dropout = 0.0 num_pooler_layers = 3 pred_hidden_size = 384 pred_num_attention_heads = 12 pred_num_hidden_layers = 12 pred_num_mask_tokens = 10 pred_zero_init_mask_tokens = True pred_mlp_ratio = 4.0 **kwargs )

Parameters

patch_size (int, optional, defaults to 16) — The size (resolution) of each patch.
crop_size (int, optional, defaults to 256) — Input resolution of the model
frames_per_clip (int, optional, defaults to 64) — The number of frames the model has been pretrained with. Does not impact inference.
tubelet_size (int, optional, defaults to 2) — The number of temporal frames used for a single rastor, check paper for more information.
hidden_size (int, optional, defaults to 1024) — Dimensionality of the encoder layers
in_chans (int, optional, defaults to 3) — The number of input channels
num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Encoder
num_hidden_layers (int, optional, defaults to 24) — The number of hidden layers
drop_path_rate (float, optional, defaults to 0.0) — Stochastic depth rate per sample (when applied in the main path of residual layers).
mlp_ratio (float, optional, defaults to 4.0) — Ratio of the hidden size of the MLPs used in Encoder relative to the hidden_size.
layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
qkv_bias (bool, optional, defaults to True) — Whether to add a bias to the queries, keys and values.
attention_probs_dropout_prob (float, optional, defaults to 0.0) — The dropout probability for attentions. The dropout probability for all fully connected layers.
hidden_act (str, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
attention_dropout (float, optional, defaults to 0.0) — The dropout probability for attentions.
num_pooler_layers (int, optional, defaults to 3) — The number of self-attention layers in the pooler.
pred_hidden_size (int, optional, defaults to 384) — Dimensionality of the predictor layers
pred_num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Predictor
pred_num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Predictor
pred_num_mask_tokens (int, optional, defaults to 10) — Define the number of mask tokens to use in the Predictor
pred_zero_init_mask_tokens (bool, optional, defaults to True) — Initialize the mask tokens in the predictor with 0.
pred_mlp_ratio (float, optional, defaults to 4.0) — Ratio of the hidden size of the MLPs used in Predictor relative to the pred_hidden_size.

This is the configuration class to store the configuration of a VJEPA2Model. It is used to instantiate an VJEPA2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VJEPA2 facebook/vjepa2-vitl-fpc64-256 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import VJEPA2Config, VJEPA2Model

>>> # Initializing a VJEPA2 vjepa2-vitl-fpc64-256 style configuration
>>> configuration = VJEPA2Config()

>>> # Initializing a model (with random weights) from the vjepa2-vitl-fpc64-256  style configuration
>>> model = VJEPA2Model(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

VJEPA2Model

class transformers.VJEPA2Model

< source >

( config: VJEPA2Config )

Parameters

config (VJEPA2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Vjepa2 Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values_videos: Tensor context_head_mask: typing.Optional[torch.Tensor] = None context_mask: typing.Optional[list[torch.Tensor]] = None target_head_mask: typing.Optional[torch.Tensor] = None target_mask: typing.Optional[list[torch.Tensor]] = None skip_predictor: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None **kwargs ) → transformers.models.vjepa2.modeling_vjepa2.VJEPA2WithMaskedInputModelOutput or tuple(torch.FloatTensor)

Parameters

pixel_values_videos (torch.Tensor of shape (batch_size, num_frames, num_channels, frame_size, frame_size)) — The tensors corresponding to the input video. Pixel values for videos can be obtained using VJEPA2VideoProcessor. See VJEPA2VideoProcessor.__call__() for details (processor_class uses VJEPA2VideoProcessor for processing videos).
context_head_mask (torch.Tensor with shape [num_heads] or [num_hidden_layers x num_heads], optional) — The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard) for the context.
context_mask (torch.Tensor with shape [batch_size, patch_size, 1], optional) — The mask position ids indicating which encoder output patches are going to be exposed to the predictor. By default, this mask is created as torch.arange(N).unsqueeze(0).repeat(B,1), indicating full context available to the predictor.
target_head_mask (torch.Tensor with shape [num_heads] or [num_hidden_layers x num_heads], optional) — The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard) for the target.
target_mask (torch.Tensor with shape [batch_size, patch_size, 1], optional) — The mask position ids indicating which encoder output patches are going to be used as a prediction target for the predictor. By default, this mask is created as torch.arange(N).unsqueeze(0).repeat(B,1), indicating that the predictor should predict all encoder patches.
skip_predictor (bool, defaults to False) — flag to skip the predictor forward, useful if you just need the encoder outputs
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Returns

transformers.models.vjepa2.modeling_vjepa2.VJEPA2WithMaskedInputModelOutput or tuple(torch.FloatTensor)

A transformers.models.vjepa2.modeling_vjepa2.VJEPA2WithMaskedInputModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VJEPA2Config) and inputs.

last_hidden_state (<class 'torch.FloatTensor'>.last_hidden_state of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
masked_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional, returned when context_mask is provided which is applied on VJEPA2Encoder outputs) — The masked hidden state of the model.
hidden_states (tuple[torch.FloatTensor, ...], optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple[torch.FloatTensor, ...], optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
predictor_output (VJEPA2WithMaskedInputPredictorOutput, optional) — The output from the Predictor module.

The VJEPA2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

VJEPA2ForVideoClassification

class transformers.VJEPA2ForVideoClassification

< source >

( config: VJEPA2Config )

Parameters

config (VJEPA2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

V-JEPA 2 Model transformer with a video classification head on top (a linear layer on top of the attentive pooler).

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values_videos: Tensor labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

Parameters

pixel_values_videos (torch.Tensor of shape (batch_size, num_frames, num_channels, frame_size, frame_size)) — The tensors corresponding to the input video. Pixel values for videos can be obtained using VJEPA2VideoProcessor. See VJEPA2VideoProcessor.__call__() for details (processor_class uses VJEPA2VideoProcessor for processing videos).
labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Returns

transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.ImageClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VJEPA2Config) and inputs.

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The VJEPA2ForVideoClassification forward method, overrides the __call__ special method.

Examples:

>>> import torch
>>> import numpy as np
>>> from transformers import AutoVideoProcessor, VJEPA2ForVideoClassification

>>> device = "cuda"

>>> video_processor = AutoVideoProcessor.from_pretrained("facebook/vjepa2-vitl-fpc16-256-ssv2")
>>> model = VJEPA2ForVideoClassification.from_pretrained("facebook/vjepa2-vitl-fpc16-256-ssv2").to(device)

>>> video = np.ones((64, 256, 256, 3))  # 64 frames, 256x256 RGB
>>> inputs = video_processor(video, return_tensors="pt").to(device)

>>> # For inference
>>> with torch.no_grad():
...     outputs = model(**inputs)
>>> logits = outputs.logits

>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])

>>> # For training
>>> labels = torch.ones(1, dtype=torch.long, device=device)
>>> loss = model(**inputs, labels=labels).loss

VJEPA2VideoProcessor

class transformers.VJEPA2VideoProcessor

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.vjepa2.video_processing_vjepa2.VJEPA2VideoProcessorInitKwargs] )

< > Update on GitHub

Transformers

V-JEPA 2

Usage example

VJEPA2Config

class transformers.VJEPA2Config

VJEPA2Model

class transformers.VJEPA2Model

forward

VJEPA2ForVideoClassification

class transformers.VJEPA2ForVideoClassification

forward

VJEPA2VideoProcessor

class transformers.VJEPA2VideoProcessor