Diffusers documentation
SkyReelsV2Transformer3DModel
SkyReelsV2Transformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in SkyReels-V2 by the Skywork AI.
The model can be loaded with the following code snippet.
from diffusers import SkyReelsV2Transformer3DModel
transformer = SkyReelsV2Transformer3DModel.from_pretrained("Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
SkyReelsV2Transformer3DModel
class diffusers.SkyReelsV2Transformer3DModel
< source >( patch_size: typing.Tuple[int] = (1, 2, 2) num_attention_heads: int = 16 attention_head_dim: int = 128 in_channels: int = 16 out_channels: int = 16 text_dim: int = 4096 freq_dim: int = 256 ffn_dim: int = 8192 num_layers: int = 32 cross_attn_norm: bool = True qk_norm: typing.Optional[str] = 'rms_norm_across_heads' eps: float = 1e-06 image_dim: typing.Optional[int] = None added_kv_proj_dim: typing.Optional[int] = None rope_max_seq_len: int = 1024 pos_embed_seq_len: typing.Optional[int] = None inject_sample_info: bool = False num_frame_per_block: int = 1 )
Parameters
- patch_size (
Tuple[int]
, defaults to(1, 2, 2)
) — 3D patch dimensions for video embedding (t_patch, h_patch, w_patch). - num_attention_heads (
int
, defaults to16
) — Fixed length for text embeddings. - attention_head_dim (
int
, defaults to128
) — The number of channels in each head. - in_channels (
int
, defaults to16
) — The number of channels in the input. - out_channels (
int
, defaults to16
) — The number of channels in the output. - text_dim (
int
, defaults to4096
) — Input dimension for text embeddings. - freq_dim (
int
, defaults to256
) — Dimension for sinusoidal time embeddings. - ffn_dim (
int
, defaults to8192
) — Intermediate dimension in feed-forward network. - num_layers (
int
, defaults to32
) — The number of layers of transformer blocks to use. - window_size (
Tuple[int]
, defaults to(-1, -1)
) — Window size for local attention (-1 indicates global attention). - cross_attn_norm (
bool
, defaults toTrue
) — Enable cross-attention normalization. - qk_norm (
str
, optional, defaults to"rms_norm_across_heads"
) — Enable query/key normalization. - eps (
float
, defaults to1e-6
) — Epsilon value for normalization layers. - inject_sample_info (
bool
, defaults toFalse
) — Whether to inject sample information into the model. - image_dim (
int
, optional) — The dimension of the image embeddings. - added_kv_proj_dim (
int
, optional) — The dimension of the added key/value projection. - rope_max_seq_len (
int
, defaults to1024
) — The maximum sequence length for the rotary embeddings. - pos_embed_seq_len (
int
, optional) — The sequence length for the positional embeddings.
A Transformer model for video-like data used in the Wan-based SkyReels-V2 model.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensor
of shape(batch_size, num_channels, height, width)
or(batch size, num_vector_embeds - 1, num_latent_pixels)
if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_states
input. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.