StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation Adyasha Maharana, Darryl Hannan and Mohit Bansal (UNC Chapel Hill) Published at ECCV 2022
+
+ StoryDALL-E \[1\] is a model trained for the task of Story Visualization \[2\].
+ The model receives a sequence of captions as input and generates a corresponding sequence of images which form a visual story depicting the narrative in the captions.
+ We modify this task to enable the model to receive an initial scene as input, which can be used as a cue for the setting of the story and also for generating unseen or low-resource visual elements. We refer to this task as Story Continuation \[1\].
+ StoryDALL-E is based on the [mega-dalle](https://github.com/borisdayma/dalle-mini) model and is adapted from the corresponding [PyTorch codebase](https://github.com/kuprel/min-dalle).
+ **This model has been developed for academic purposes only.**
+
+ \[[Paper](http://arxiv.org/abs/2209.06192)\] \[[Code](https://github.com/adymaharana/storydalle)\] \[[Model Card](https://github.com/adymaharana/storydalle/blob/main/MODEL_CARD.MD)\]
+
+ ### Dataset
+ This model has been trained using the Pororo story visualization dataset \[1\].
+ The data was adapted from the popular cartoon series *Pororo the Little Penguin* and originally released by \[2\].
+ The Pororo dataset contains 9 recurring characters, as shown below, in the decreasing order of their frequency in the training data.
+
+
+
+ The training dataset contains nearly 10,000 samples in the training set. Most of the scenes occur in a snowy village, surrounded by hills, trees and houses. A few episodes are located in gardens or water bodies. All the captions are in the English language and predominantly contain verbs in the present tense. Additionally, the training of this model starts from the pretrained checkpoint of mega-dalle, which is trained on the Conceptual Captions dataset.
+
+ ### Intended Use
+ This model is intended for generating visual stories containing the 9 characters in the Pororo dataset. This version of the StoryDALL-E model is reasonable at the following scenarios:
+ * Frames containing a single character.
+ * Overtly visual actions such as *making cookies*, *walking*, *reading a book*, *sitting*.
+ * Scenes taking place in snowy settings, indoors and gardens.
+ * Visual stories contaning 1-3 characters across all frames.
+ * Scene transitions e.g. from day to night.
+ * Moderately capable of generating semantic concepts that do not appear in the story continuation dataset, such as *doughnut* and *lion*.
+
+ Here are some examples of generated visual stories for the above-mentioned settings.
+
+
+
+
+
+ Due to the small training dataset size for story visualization, the model has poor generalization to some unseen settings. The model struggles to generate coherent images in the following scenarios.
+ * Multiple characters in a frame.
+ * Non-visual actions such as *compliment*.
+ * Characters that are infrequent in the training dataset e.g. Rody, Harry.
+ * Background locations that are not found in the cartoon e.g. a busy city.
+ * Color-based descriptions for object.
+ * Completely new characters based on textual descriptions.
+
+ In the following demo, four or less captions can be entered in the `caption` text fields for the visual story.
+ Select a `source` frame based on the character that is predominant in your visual story.
+ `top_k` refers to the number of highest probability vocabulary tokens to keep for top-k-filtering.
+ Only the most probable tokens with probabilities that add up to `top_p` or higher are kept for generation.
+ Set `supercondition` to True to enable generation using a null hypothesis.
+ Select between 1-4 `n_candidates` to generate a diverse set of stories for the given captions.
+
+ Feel free to send feedback to adyasha@cs.unc.edu.
+ ''')
+
+ with gr.Row():
+ with gr.Column():
+ caption_1 = gr.Textbox(label="Caption 1", value='Pororo is reading a book.')
+ caption_2 = gr.Textbox(label="Caption 2", value='Pororo is sleeping on the couch.')
+ caption_3 = gr.Textbox(label="Caption 3", value='Pororo wakes up in the middle of the night in his bed.')
+ caption_4 = gr.Textbox(label="Caption 4", value='Pororo is in his bedroom and looks terrified.')
+ source = gr.Radio(["Pororo", "Loopy", "Crong", "Poby", "Eddy", "Petty", "Tongtong", "Rody", "Harry"],
+ label="Source", value="Pororo")
+ top_k = gr.Slider(16, 128, label="top_k", value=32)
+ top_p = gr.Slider(0.01, 1.0, label="top_p", value=0.2)
+ supercondition = gr.Checkbox(value=False, label='supercondition')
+ n_candidates = gr.Dropdown([1, 2, 3, 4], value=4, label='n_candidates')
+
+ with gr.Row():
+ # clear_btn = gr.Button("Clear")
+ submit_btn = gr.Button("Submit")
+
+ with gr.Column():
+ with gr.Row():
+ frame_1_label = gr.Button("Frame 1")
+ frame_2_label = gr.Button("Frame 2")
+ frame_3_label = gr.Button("Frame 3")
+ frame_4_label = gr.Button("Frame 4")
+ # frame_1_label = gr.Label("Frame 1")
+ # frame_2_label = gr.Label("Frame 2")
+ # frame_3_label = gr.Label("Frame 3")
+ # frame_4_label = gr.Label("Frame 4")
+ output = gr.Image(label="", elem_id='output')
+
+ submit_btn.click(fn=predict,
+ inputs=[caption_1, caption_2, caption_3, caption_4, source, top_k, top_p, n_candidates,
+ supercondition], outputs=output)
+
+ gr.Markdown('''
+ ### References
+
+ \[1\] Maharana, Adyasha, et al. "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation." ECCV. 2022.
+
+ \[2\] Li, Yitong, et al. "Storygan: A sequential conditional gan for story visualization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
+
+ \[3\] Kim, Kyung-Min, et al. "DeepStory: video story QA by deep embedded memory networks." Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017.
+
+ \[4\] Sharma, Piyush, et al. "Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
+ ''')
+
+ demo.launch(share=True)
+
+
+if __name__ == "__main__":
+ args_list = ['--model_name_or_path', './ckpt/25.pth',
+ '--prefix_model_name_or_path', './1.3B/',
+ '--dataset_name', 'pororo',
+ '--tuning_mode', 'story',
+ '--preseqlen', '32',
+ '--condition',
+ '--story_len', '4',
+ '--sent_embed', '512',
+ '--prefix_dropout', '0.2',
+ '--data_dir', '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/',
+ '--dataloader_num_workers', '1',
+ '--do_eval',
+ '--per_gpu_eval_batch_size', '16',
+ '--mode', 'story']
+
+ parser = argparse.ArgumentParser(description='arguments for training/evaluating prefix-tuning DALLE')
+
+ # Model Arguments
+ parser.add_argument('--model_name_or_path', type=str, default=None,
+ help='The model checkpoint for weights initialization.')
+ parser.add_argument('--prefix_model_name_or_path', type=str, default=None,
+ help='The prefix model checkpoint for weights initialization.')
+ parser.add_argument('--prefix_mode', type=str, default='activation', help='activation or embedding')
+ parser.add_argument('--preseqlen', type=int, default=0, help='how many tokens of prefix should we include.')
+ parser.add_argument('--optim_prefix', action="store_true",
+ help='set to True if optimizing prefix directly; no if through amortized function')
+ parser.add_argument('--tuning_mode', type=str, default='prefixtune', help='prefixtune or finetune')
+ parser.add_argument('--top_k_layers', type=int, default=2,
+ help='In finetuning setting, if we only tune the top k layers.')
+ parser.add_argument('--parameterize_mode', type=str, default='mlp',
+ help="mlp or emb to parametrize when we optimize for the embeddings.")
+ parser.add_argument('--prefix_dropout', type=float, default=0.0, help='dropout rate for the prefix tuning model.')
+ parser.add_argument('--teacher_dropout', type=float, default=0.0, help='dropout rate for the teacher model.')
+ parser.add_argument('--init_random', action="store_true", help="set True if initializing random embeddings")
+ parser.add_argument('--init_shallow', action="store_true", help="set True if not using reparameterization")
+ parser.add_argument('--init_shallow_word', type=bool, default=False,
+ help="set True if init_shallow and specify words")
+ parser.add_argument('--replay_buffer', action="store_true", help="set True if using replay buffer in training")
+ parser.add_argument('--gumbel', action="store_true", help="set True if using the gumbel softmax in training")
+ parser.add_argument('--hidden_dim_prefix', type=float, default=512, help="hidden dim of MLP for generating prefix?")
+
+ # Data Arguments
+ parser.add_argument('--dataset_name', type=str, default='pororo', help="dataset name")
+ parser.add_argument('--data_dir', type=str, default=None, help="Path to data directory")
+ parser.add_argument('--lowdata_token', type=str, default='story',
+ help="The token to be prepended at initialization time.")
+ parser.add_argument('--use_lowdata_token', type=bool, default=True,
+ help="Whether we should use the lowdata token for prefix-tuning")
+ parser.add_argument('--train_embeddings', action="store_true", help="Whether to train word embeddings")
+ parser.add_argument('--train_max_target_length', type=int, default=100,
+ help='the max target length for training data.')
+ parser.add_argument('--val_max_target_length', type=int, default=100, help='the max target length for dev data.')
+ parser.add_argument('--dataloader_num_workers', type=int, default=8, help='number of workers when loading data')
+
+ # new arguments for story
+ parser.add_argument('--prompt', action="store_true", help="set True if using prompts in StoryDALLE")
+ parser.add_argument('--story_len', type=int, default=4, help='the max target length for dev data.')
+ parser.add_argument('--sent_embed', type=int, default=384, help='the max target length for dev data.')
+ parser.add_argument('--condition', action="store_true", help="set True if using prompts in StoryDALLE")
+ parser.add_argument('--clip_embed', action="store_true", help="set True if using prompts in StoryDALLE")
+
+ # Training Arguments
+ parser.add_argument('--output_dir', type=str, default=None, help="Path to data directory")
+ parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+ parser.add_argument("--do_eval", action="store_true", help="Whether to run evaluation.")
+ parser.add_argument("--do_test", action="store_true", help="Whether to run test.")
+ parser.add_argument('--seed', type=int, default=42, help='seed for reproducibility')
+ parser.add_argument("--overwrite_output_dir", action="store_true", help="Whether to overwrite output dir.")
+ parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+ parser.add_argument(
+ "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation."
+ )
+ parser.add_argument(
+ "--gradient_accumulation_steps",
+ type=int,
+ default=1,
+ help="Number of updates steps to accumulate before performing a backward/update pass.",
+ )
+
+ parser.add_argument('--mode', type=str, default='val', help="mval or test.")
+
+ parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+ parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
+ parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+ parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+ parser.add_argument(
+ "--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform."
+ )
+ parser.add_argument(
+ "--max_steps",
+ default=-1,
+ type=int,
+ help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+ )
+ parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+ parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
+ parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
+ parser.add_argument(
+ "--eval_all_checkpoints",
+ action="store_true",
+ help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
+ )
+ parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
+ parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+ parser.add_argument(
+ "--fp16",
+ action="store_true",
+ help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
+ )
+
+ parser.add_argument("--debug", action="store_true", help="Whether to debug the demo.")
+
+ args = parser.parse_args(args_list)
+
+ main(args)
+
+
+
+
+
diff --git a/dalle/__init__.py b/dalle/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dalle/__pycache__/__init__.cpython-38.pyc b/dalle/__pycache__/__init__.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..34bc3bf5d4235efc211df0cad50d4941d93a1d5d
Binary files /dev/null and b/dalle/__pycache__/__init__.cpython-38.pyc differ
diff --git a/dalle/__pycache__/trainer_prefix.cpython-38.pyc b/dalle/__pycache__/trainer_prefix.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f9c94501d621c92f4a945009143c8ea29bd1cc5e
Binary files /dev/null and b/dalle/__pycache__/trainer_prefix.cpython-38.pyc differ
diff --git a/dalle/models/__init__.py b/dalle/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3dee465bf5b7ac96b055d00d8f1aa6918e86a24f
--- /dev/null
+++ b/dalle/models/__init__.py
@@ -0,0 +1,1462 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+
+import os
+import torch
+import torch.nn as nn
+import pytorch_lightning as pl
+from typing import Optional, Tuple, Union
+from omegaconf import OmegaConf
+from torch.cuda.amp import autocast
+from torch.optim.lr_scheduler import CosineAnnealingLR, LambdaLR
+from torch.nn import functional as F
+from .stage1.vqgan import VQGAN
+from .stage2.transformer import Transformer1d, iGPT
+from .stage2.layers import Block
+from .. import utils
+from ..utils.config import get_base_config
+from ..utils.sampling import sampling, sampling_igpt, get_positional_encoding, sampling_prefix, sampling_conditional
+from ..utils.utils import save_image
+from .tokenizer import build_tokenizer
+import numpy as np
+from .stage2.layers import CrossAttentionLayer
+
+_MODELS = {
+ 'minDALL-E/1.3B': 'https://arena.kakaocdn.net/brainrepo/models/minDALL-E/57b008f02ceaa02b779c8b7463143315/1.3B.tar.gz'
+}
+
+class Dalle(pl.LightningModule):
+ def __init__(self,
+ config: OmegaConf) -> None:
+ super().__init__()
+ self.tokenizer = None
+ self.stage1 = VQGAN(n_embed=config.stage1.n_embed,
+ embed_dim=config.stage1.embed_dim,
+ hparams=config.stage1.hparams)
+ self.stage2 = Transformer1d(vocab_size_txt=config.stage2.vocab_size_txt,
+ vocab_size_img=config.stage2.vocab_size_img,
+ hparams=config.stage2.hparams)
+ self.config = config
+ self.config_stage1 = config.stage1
+ self.config_stage2 = config.stage2
+ self.config_dataset = config.dataset
+
+ # # make the parameters in stage 1 not trainable
+ # self.stage1.eval()
+ # for p in self.stage1.parameters():
+ # p.requires_grad = False
+
+ @classmethod
+ def from_pretrained(cls, args) -> Tuple[nn.Module, OmegaConf]:
+
+ path = args.model_name_or_path
+ config_new = OmegaConf.load(os.path.join(path, 'config.yaml'))
+ if args.do_train:
+ config_base = get_base_config('finetuning')
+ config_update = OmegaConf.merge(config_base, config_new)
+ for key, val in vars(args).items():
+ if key in config_update.optimizer.keys():
+ OmegaConf.update(config_update, "optimizer.%s" % key, val, merge=False)
+ if key in config_update.experiment.keys():
+ OmegaConf.update(config_update, "experiment.%s" % key, val, merge=False)
+ else:
+ config_base = get_base_config('default')
+ config_update = OmegaConf.merge(config_base, config_new)
+
+ model = cls(config_update)
+ model.tokenizer = build_tokenizer(os.path.join(path, 'tokenizer'),
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+
+ print("Loading models from checkpoint %s" % path)
+
+ if hasattr(args, 'dalle_path') and args.dalle_path and args.dalle_path.endswith('.pth'):
+ model.load_state_dict(torch.load(args.dalle_path)["model_state_dict"])
+ else:
+ model.stage1.from_ckpt(os.path.join(path, 'stage1_last.ckpt'))
+ model.stage2.from_ckpt(os.path.join(path, 'stage2_last.ckpt'))
+
+ return model, config_update
+
+
+ @torch.no_grad()
+ def sampling(self,
+ prompt: Union[str, torch.LongTensor],
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True) -> torch.FloatTensor:
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if type(prompt) == str:
+ tokens = self.tokenizer.encode(prompt)
+ tokens = torch.LongTensor(tokens.ids)
+ else:
+ tokens = prompt
+ tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ # Check if the encoding works as intended
+ # print(self.tokenizer.decode_batch(tokens.tolist(), skip_special_tokens=True)[0])
+
+ tokens = tokens.to(device)
+ codes = sampling(self.stage2,
+ tokens,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16)
+ codes = codes.view(num_candidates, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+ def forward(self,
+ images: torch.FloatTensor,
+ texts: Optional[torch.LongTensor],
+ past=None
+ ) -> tuple:
+ B, C, H, W = images.shape
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+ pos_enc_tokens = get_positional_encoding(texts, mode='1d')
+ codes = codes.clone().detach()
+ pos_enc_code = get_positional_encoding(codes, mode='1d')
+ # codes = codes.unsqueeze(-1)
+ # pos_enc_code = pos_enc_code.unsqueeze(-1)
+ logits_img, logits_txt = self.stage2(codes, texts, pos_enc_code, pos_enc_tokens, past)
+ return logits_img, logits_txt, codes
+
+ def training_step(self, batch, batch_idx):
+ images, texts = batch
+ logits_img, logits_txt, codes = self(images, texts)
+
+ loss_img = F.cross_entropy(logits_img.view(-1, logits_img.shape[-1]), codes.view(-1))
+ loss_txt = F.cross_entropy(logits_txt.view(-1, logits_txt.shape[-1]), texts[:, 1:].reshape(-1))
+ self.log("train/loss_img", loss_img, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ self.log("train/loss_txt", loss_txt, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ return loss_img + loss_txt
+
+ def validation_step(self, batch, batch_idx):
+ images, texts = batch
+ logits_img, logits_txt, codes = self(images, texts)
+ # print(logits_img.shape, logits_txt.shape, codes.shape, texts.shape)
+
+ loss_img = F.cross_entropy(logits_img.view(-1, logits_img.shape[-1]), codes.view(-1))
+ loss_txt = F.cross_entropy(logits_txt.view(-1, logits_txt.shape[-1]), texts[:, 1:].reshape(-1))
+ self.log("val/loss_img", loss_img, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ self.log("val/loss_txt", loss_txt, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ return loss_img + loss_txt
+
+ def configure_optimizers(self):
+ assert self.config.optimizer.opt_type == 'adamW'
+ # assert self.config.optimizer.sched_type == 'cosine'
+
+ opt = torch.optim.AdamW(self.parameters(),
+ lr=self.config.optimizer.learning_rate,
+ betas=self.config.optimizer.betas,
+ weight_decay=self.config.optimizer.weight_decay)
+ # sched = CosineAnnealingLR(opt,
+ # T_max=self.config.optimizer.max_steps,
+ # eta_min=self.config.optimizer.min_lr)
+
+ def lr_lambda(current_step: int):
+ return max(
+ 0.0, float(self.config.optimizer.max_steps - current_step) / float(max(1, self.config.optimizer.max_steps))
+ )
+
+ sched = LambdaLR(opt, lr_lambda)
+ sched = {
+ 'scheduler': sched,
+ 'name': 'linear'
+ }
+ return [opt], [sched]
+
+ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure,
+ on_tpu=False, using_native_amp=False, using_lbfgs=False):
+ optimizer.step(closure=optimizer_closure)
+ self.lr_schedulers().step()
+ self.log("lr", self.lr_schedulers().get_last_lr()[0], on_step=True, on_epoch=False, prog_bar=True, logger=True)
+
+ def on_epoch_start(self):
+ self.stage1.eval()
+
+
+class ImageGPT(pl.LightningModule):
+ def __init__(self,
+ config: OmegaConf) -> None:
+ super().__init__()
+ self.stage1 = VQGAN(n_embed=config.stage1.n_embed,
+ embed_dim=config.stage1.embed_dim,
+ hparams=config.stage1.hparams)
+ self.stage2 = iGPT(vocab_size_img=config.stage2.vocab_size_img,
+ use_cls_cond=config.stage2.use_cls_cond,
+ hparams=config.stage2.hparams)
+ self.config = config
+ self.use_cls_cond = config.stage2.use_cls_cond
+
+ # make the parameters in stage 1 not trainable
+ self.stage1.eval()
+ for p in self.stage1.parameters():
+ p.requires_grad = False
+
+ @classmethod
+ def from_pretrained(cls,
+ path_upstream: str,
+ path_downstream: str) -> Tuple[nn.Module, OmegaConf]:
+ config_base = get_base_config(use_default=False)
+ config_down = OmegaConf.load(path_downstream)
+ config_down = OmegaConf.merge(config_base, config_down)
+
+ model = cls(config_down)
+ model.stage1.from_ckpt(os.path.join(path_upstream, 'stage1_last.ckpt'), strict=True)
+ model.stage2.from_ckpt(os.path.join(path_upstream, 'stage2_last.ckpt'), strict=False)
+ return model, config_down
+
+ def sample(self,
+ cls_idx: Optional[int] = None,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 16,
+ device: str = 'cuda:0',
+ use_fp16: bool = True,
+ is_tqdm: bool = True) -> torch.FloatTensor:
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if cls_idx is None:
+ sos = self.stage2.sos.repeat(num_candidates, 1, 1)
+ else:
+ sos = torch.LongTensor([cls_idx]).to(device=device)
+ sos = sos.repeat(num_candidates)
+ sos = self.stage2.sos(sos).unsqueeze(1)
+
+ codes = sampling_igpt(self.stage2,
+ sos=sos,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ is_tqdm=is_tqdm)
+ codes = codes.view(num_candidates, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+ def forward(self,
+ images: torch.FloatTensor,
+ labels: Optional[torch.LongTensor] = None) -> torch.FloatTensor:
+ B, C, H, W = images.shape
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+ logits = self.stage2(codes, labels)
+ return logits, codes
+
+ def training_step(self, batch, batch_idx):
+ images, labels = batch
+ logits, codes = self(images, labels=labels if self.use_cls_cond else None)
+ loss = F.cross_entropy(logits.view(-1, logits.shape[-1]), codes.view(-1))
+ self.log("train/loss", loss, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ return loss
+
+ def validation_step(self, batch, batch_idx):
+ images, labels = batch
+ logits, codes = self(images, labels=labels if self.use_cls_cond else None)
+ loss = F.cross_entropy(logits.view(-1, logits.shape[-1]), codes.view(-1))
+ self.log("val/loss", loss, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ return loss
+
+ def configure_optimizers(self):
+ assert self.config.optimizer.opt_type == 'adamW'
+ assert self.config.optimizer.sched_type == 'cosine'
+
+ opt = torch.optim.AdamW(self.parameters(),
+ lr=self.config.optimizer.base_lr,
+ betas=self.config.optimizer.betas,
+ weight_decay=self.config.optimizer.weight_decay)
+ sched = CosineAnnealingLR(opt,
+ T_max=self.config.optimizer.max_steps,
+ eta_min=self.config.optimizer.min_lr)
+ sched = {
+ 'scheduler': sched,
+ 'name': 'cosine'
+ }
+ return [opt], [sched]
+
+ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure,
+ on_tpu=False, using_native_amp=False, using_lbfgs=False):
+ optimizer.step(closure=optimizer_closure)
+ self.lr_schedulers().step()
+ self.log("lr", self.lr_schedulers().get_last_lr()[0], on_step=True, on_epoch=False, prog_bar=True, logger=True)
+
+ def on_epoch_start(self):
+ self.stage1.eval()
+
+
+class PromptDalle(Dalle):
+ """Classification Head for transformer encoders"""
+ def __init__(self, config):
+ super().__init__(config)
+ print('Initializing the PromptTuning model')
+
+ self.config = config
+ self.n_embd = config.stage2.hparams.embed_dim
+ self.preseqlen = config.prompt.preseqlen
+ self.prefix_dropout = config.prompt.prefix_dropout
+
+ # DIFFERENT PARAMETRIZATION:
+
+ print('[Full prompt-tuning Setting :) ]')
+ self.input_tokens = torch.arange(self.preseqlen).long()
+ self.wte = nn.Embedding(self.preseqlen, self.n_embd)
+ self.control_trans = nn.Sequential(
+ nn.Linear(self.n_embd, self.n_embd),
+ nn.Tanh(),
+ nn.Linear(self.n_embd, self.n_embd))
+ self.get_prompt = self.get_prompt_p5
+ self.dropout = nn.Dropout(self.prefix_dropout)
+
+ ###### NUM PARAMS #########
+ total_param = 0
+ for name, param in self.named_parameters():
+ # print(param.shape)
+ total_param += param.numel()
+ print('Total parameters is {}'.format(total_param))
+
+
+ @classmethod
+ def from_pretrained(cls, args) -> Tuple[nn.Module, OmegaConf]:
+
+ # if not args.model_name_or_path:
+ # args.model_name_or_path = args.prefix_model_name_or_path
+
+ path = args.prefix_model_name_or_path
+ path = _MODELS[path] if path in _MODELS else path
+ path = utils.realpath_url_or_path(path, root=os.path.expanduser("~/.cache/minDALL-E"))
+
+ config_base = get_base_config('prompt_tuning')
+ config_new = OmegaConf.load(os.path.join(path, 'config.yaml'))
+ config_update = OmegaConf.merge(config_base, config_new)
+
+ for key, val in vars(args).items():
+ if key in config_update.prompt.keys():
+ OmegaConf.update(config_update, "prompt.%s" % key, val, merge=False)
+ if key in config_update.optimizer.keys():
+ OmegaConf.update(config_update, "optimizer.%s" % key, val, merge=False)
+ if key in config_update.experiment.keys():
+ OmegaConf.update(config_update, "experiment.%s" % key, val, merge=False)
+
+ model = cls(config_update)
+ model.tokenizer = build_tokenizer(os.path.join(path, 'tokenizer'),
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+
+ if args.model_name_or_path:
+ print("Loading model from pretrained checkpoint %s" % args.model_name_or_path)
+ # model.from_ckpt(args.model_name_or_path)
+ try:
+ model.load_state_dict(torch.load(args.model_name_or_path)['state_dict'])
+ except KeyError:
+ model.load_state_dict(torch.load(args.model_name_or_path)['model_state_dict'])
+
+ else:
+ print("Loading models from checkpoint %s" % path)
+ model.stage1.from_ckpt(os.path.join(path, 'stage1_last.ckpt'))
+ model.stage2.from_ckpt(os.path.join(path, 'stage2_last.ckpt'))
+
+ return model, config_update
+
+ def get_prompt_p5(self, bsz=None, eval=False):
+ input_tokens = self.input_tokens.unsqueeze(0).expand(bsz, -1).to(self.device)
+ temp_control = self.wte(input_tokens)
+ past_key_values = self.control_trans(temp_control) #bsz, seqlen, layer*emb
+ if not eval:
+ past_key_values = self.dropout(past_key_values)
+ return past_key_values
+
+ def forward(self,
+ images: torch.FloatTensor,
+ texts: Optional[torch.LongTensor],
+ **kwargs,
+ ):
+
+ #{"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 'src':src}
+
+ B, C, H, W = images.shape
+ prompt = self.get_prompt(bsz=B)
+ pos_enc_prompt = get_positional_encoding(self.input_tokens.unsqueeze(0).expand(B, -1).to(self.device), mode='1d')
+
+ # if self.mode_para == 2 and src_attn is not None and tgt_attn is not None:
+ # attention_mask = torch.cat([src_attn, tgt_attn], dim=1)
+
+
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+
+ pos_enc_tokens = get_positional_encoding(texts, mode='1d')
+ codes = codes.clone().detach()
+ pos_enc_code = get_positional_encoding(codes, mode='1d')
+ # codes = codes.unsqueeze(-1)
+ # pos_enc_code = pos_enc_code.unsqueeze(-1)
+ # print(images.shape, codes.shape, texts.shape)
+ logits_img, logits_txt = self.stage2(codes, texts, pos_enc_code, pos_enc_tokens, prompt=prompt, pos_prompt=pos_enc_prompt)
+ return logits_img, logits_txt, codes
+
+
+ @torch.no_grad()
+ def sampling(self,
+ tokens: torch.LongTensor,
+ prompt: torch.FloatTensor,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True,
+ labels = None) -> torch.FloatTensor:
+ self.stage1.eval()
+ self.stage2.eval()
+
+ # tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ tokens = tokens.to(device)
+ pos_enc_prompt = get_positional_encoding(self.input_tokens.unsqueeze(0).expand(num_candidates, -1).to(self.device), mode='1d')
+
+ codes = sampling(self.stage2,
+ tokens,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+
+ codes = codes.view(-1, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+
+ @torch.no_grad()
+ def predict_step(self, batch, batch_idx, return_images=False):
+ orig_images, texts = batch
+
+ # extra for checks
+ logits_img, logits_txt, codes = self(orig_images, texts)
+ pred = torch.argmax(logits_img.view(-1, logits_img.shape[-1]), dim=-1)
+ bs = orig_images.shape[0]
+ pred = pred.view(bs, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(pred) * 0.5 + 0.5, 0, 1).cpu().numpy() # [B, 256, 256]
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+
+ # print(texts.shape, orig_images.shape)
+ prompt = self.get_prompt(bsz=5, eval=True)
+
+ images = []
+ for i, t in enumerate(texts):
+ pixels = self.sampling(t, prompt, top_k=16, num_candidates=5, labels=codes[i]).cpu().numpy()
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+ images.append(pixels)
+
+ if return_images:
+ return images
+ else:
+ save_image(orig_images, pixels, './out/images/pororo_prompt', batch_idx+10)
+ save_image(orig_images, images, './out/images/pororo_prompt', batch_idx)
+
+
+class PrefixTuningDalle(Dalle):
+ """Classification Head for transformer encoders"""
+ def __init__(self, config):
+ super().__init__(config)
+ print('Initializing the PrefixTuning model')
+
+ self.config = config
+
+ self.match_n_layer = config.stage2.hparams.n_layers
+ self.match_n_head = config.stage2.hparams.n_heads
+ self.match_n_embd = config.stage2.hparams.embed_dim // config.stage2.hparams.n_heads
+ self.n_embd = config.stage2.hparams.embed_dim
+
+ self.optim_prefix = config.prefix.optim_prefix
+ self.preseqlen = config.prefix.preseqlen
+ self.prefix_dropout = config.prefix.prefix_dropout
+ self.init_random = config.prefix.init_random
+ self.hidden_dim_prefix = config.prefix.hidden_dim_prefix
+
+ self.lowdata_token = config.prefix.lowdata_token
+ self.init_shallow = config.prefix.init_shallow
+ self.init_shallow_word = config.prefix.init_shallow_word
+ self.mode_para = 0
+
+ print('PrefixTuning')
+ print('preseqlen is {}, optimizing the prefix directly'.format(self.preseqlen))
+
+ # DIFFERENT PARAMETRIZATION:
+
+ print('[Full prefix-tuning Setting :) ]')
+ self.input_tokens = torch.arange(self.preseqlen).long()
+ self.wte = nn.Embedding(self.preseqlen, self.n_embd)
+ self.control_trans = nn.Sequential(
+ nn.Linear(self.n_embd, self.hidden_dim_prefix),
+ nn.Tanh(),
+ nn.Linear(self.hidden_dim_prefix, self.match_n_layer * 2 * self.n_embd))
+ self.get_prompt = self.get_prompt_p5
+ self.dropout = nn.Dropout(self.prefix_dropout)
+
+ ###### NUM PARAMS #########
+ total_param = 0
+ for name, param in self.named_parameters():
+ # print(param.shape)
+ total_param += param.numel()
+ print('Total parameters is {}'.format(total_param))
+
+
+ @classmethod
+ def from_pretrained(cls, args) -> Tuple[nn.Module, OmegaConf]:
+
+ # if not args.model_name_or_path:
+ # args.model_name_or_path = args.prefix_model_name_or_path
+
+ path = args.prefix_model_name_or_path
+ path = _MODELS[path] if path in _MODELS else path
+ path = utils.realpath_url_or_path(path, root=os.path.expanduser("~/.cache/minDALL-E"))
+
+ config_base = get_base_config('prefixtuning')
+ config_new = OmegaConf.load(os.path.join(path, 'config.yaml'))
+ config_update = OmegaConf.merge(config_base, config_new)
+
+ for key, val in vars(args).items():
+ if key in config_update.prefix.keys():
+ OmegaConf.update(config_update, "prefix.%s" % key, val, merge=False)
+ if key in config_update.optimizer.keys():
+ OmegaConf.update(config_update, "optimizer.%s" % key, val, merge=False)
+ if key in config_update.experiment.keys():
+ OmegaConf.update(config_update, "experiment.%s" % key, val, merge=False)
+
+ model = cls(config_update)
+ model.tokenizer = build_tokenizer(os.path.join(path, 'tokenizer'),
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+
+ if args.model_name_or_path:
+ print("Loading model from pretrained checkpoint %s" % args.model_name_or_path)
+ # model.from_ckpt(args.model_name_or_path)
+ try:
+ model.load_state_dict(torch.load(args.model_name_or_path)['state_dict'])
+ except KeyError:
+ model.load_state_dict(torch.load(args.model_name_or_path)['model_state_dict'])
+
+ else:
+ print("Loading models from checkpoint %s" % path)
+ model.stage1.from_ckpt(os.path.join(path, 'stage1_last.ckpt'))
+ model.stage2.from_ckpt(os.path.join(path, 'stage2_last.ckpt'))
+
+ return model, config_update
+
+ def get_prompt_p5(self, bsz=None, eval=False):
+ input_tokens = self.input_tokens.unsqueeze(0).expand(bsz, -1).to(self.device)
+ temp_control = self.wte(input_tokens)
+ past_key_values = self.control_trans(temp_control) #bsz, seqlen, layer*emb
+ bsz, seqlen, _ = past_key_values.shape
+ past_key_values = past_key_values.view(bsz, seqlen, self.match_n_layer * 2, self.match_n_head,
+ self.match_n_embd)
+ if not eval:
+ past_key_values = self.dropout(past_key_values)
+ # past_key_values = past_key_values.permute([2, 0, 3, 1, 4]).split(2)
+ past_key_values = past_key_values.permute([2, 0, 3, 1, 4])
+ # print(past_key_values.shape)
+ return past_key_values.split(2)
+
+ def forward(self,
+ images: torch.FloatTensor,
+ texts: Optional[torch.LongTensor],
+ **kwargs,
+ ):
+
+ #{"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 'src':src}
+
+ B, C, H, W = images.shape
+
+ if self.mode_para == 2:
+ past_key_values_prompt = self.get_prompt(bsz=B)
+ else:
+ past_key_values_prompt = self.get_prompt(bsz=B)
+
+ # if self.mode_para == 2 and src_attn is not None and tgt_attn is not None:
+ # attention_mask = torch.cat([src_attn, tgt_attn], dim=1)
+
+
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+
+ pos_enc_tokens = get_positional_encoding(texts, mode='1d')
+ codes = codes.clone().detach()
+ pos_enc_code = get_positional_encoding(codes, mode='1d')
+ # codes = codes.unsqueeze(-1)
+ # pos_enc_code = pos_enc_code.unsqueeze(-1)
+ # print(images.shape, codes.shape, texts.shape)
+ logits_img, logits_txt = self.stage2(codes, texts, pos_enc_code, pos_enc_tokens, past_key_values_prompt)
+ return logits_img, logits_txt, codes
+
+ @torch.no_grad()
+ def sampling(self,
+ tokens: torch.LongTensor,
+ past: torch.FloatTensor,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True,
+ labels = None) -> torch.FloatTensor:
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if len(past.shape) == 6:
+ n_layers, temp, bs, n_heads, seq_len, n_dim = past.shape
+ past = past.view(n_layers, temp, bs*n_heads, seq_len, n_dim)
+
+ tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ # Check if the encoding works as intended
+ # print(self.tokenizer.decode_batch(tokens.tolist(), skip_special_tokens=True)[0])
+
+ tokens = tokens.to(device)
+ codes = sampling_prefix(self.stage2,
+ tokens,
+ past,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ labels = None if labels is None else labels.view(-1))
+
+ # codes = sampling(self.stage2,
+ # tokens,
+ # top_k=top_k,
+ # top_p=top_p,
+ # softmax_temperature=softmax_temperature,
+ # use_fp16=use_fp16)
+
+ codes = codes.view(num_candidates, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+ def training_step(self, batch, batch_idx):
+ images, texts = batch
+ logits_img, logits_txt, codes = self(images, texts)
+
+ loss_img = F.cross_entropy(logits_img.view(-1, logits_img.shape[-1]), codes.view(-1))
+ loss_txt = F.cross_entropy(logits_txt.view(-1, logits_txt.shape[-1]), texts[:, 1:].reshape(-1))
+ self.log("train/loss_img", loss_img, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ self.log("train/loss_txt", loss_txt, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ return loss_img + loss_txt
+
+ def validation_step(self, batch, batch_idx):
+ images, texts = batch
+ logits_img, logits_txt, codes = self(images, texts)
+ # print(logits_img.shape, logits_txt.shape, codes.shape, texts.shape)
+
+ loss_img = F.cross_entropy(logits_img.view(-1, logits_img.shape[-1]), codes.view(-1))
+ loss_txt = F.cross_entropy(logits_txt.view(-1, logits_txt.shape[-1]), texts[:, 1:].reshape(-1))
+ self.log("val/loss_img", loss_img, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ self.log("val/loss_txt", loss_txt, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ return loss_img + loss_txt
+
+ @torch.no_grad()
+ def predict_step(self, batch, batch_idx, return_images=False):
+ orig_images, texts = batch
+
+ # extra for checks
+ logits_img, logits_txt, codes = self(orig_images, texts)
+ pred = torch.argmax(logits_img.view(-1, logits_img.shape[-1]), dim=-1)
+ bs = orig_images.shape[0]
+ pred = pred.view(bs, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(pred) * 0.5 + 0.5, 0, 1).cpu().numpy() # [B, 256, 256]
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+
+
+ # print(texts.shape, orig_images.shape)
+ # concatenate the list of prompts (split by n_head) for better downstream processing
+ past_key_values_prompt = self.get_prompt(bsz=5, eval=True)
+ # print(past_key_values_prompt[0].shape, past_key_values_prompt[1].shape, len(past_key_values_prompt))
+ past_key_values_prompt = torch.cat([x.unsqueeze(0) for x in past_key_values_prompt], dim=0)
+ n_layers, temp, bs, n_heads, seq_len, n_dim = past_key_values_prompt.shape
+ past_key_values_prompt = past_key_values_prompt.view(n_layers, temp, bs*n_heads, seq_len, n_dim)
+ # print(past_key_values_prompt.shape)
+ images = []
+ for i, t in enumerate(texts):
+ pixels = self.sampling(t, past_key_values_prompt, top_k=16, num_candidates=5, labels=codes[i]).cpu().numpy()
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+ images.append(pixels)
+ # images.extend([p for p in pixels])
+ # print([i.shape for i in images])
+
+
+ if return_images:
+ return images
+ else:
+ save_image(orig_images, pixels, './out/images/pororo_prefix', batch_idx+10)
+ save_image(orig_images, images, './out/images/pororo_prefix', batch_idx)
+
+
+class ConditionalDalle(Dalle):
+ """Classification Head for transformer encoders"""
+ def __init__(self, config):
+ super().__init__(config)
+ print('Initializing the Conditional Dalle model')
+
+ self.config = config
+
+ print('Setting up Cross-attention Layers')
+ self.init_cross_attention(list(range(2,42,3)), config.stage2.hparams)
+
+ ###### NUM PARAMS #########
+ total_param = 0
+ for name, param in self.named_parameters():
+ # print(param.shape)
+ total_param += param.numel()
+ print('Total parameters is {}'.format(total_param))
+
+ @classmethod
+ def from_pretrained(cls, args) -> Tuple[nn.Module, OmegaConf]:
+
+ # if not args.model_name_or_path:
+ # args.model_name_or_path = args.prefix_model_name_or_path
+
+ path = args.model_name_or_path
+ config_new = OmegaConf.load(os.path.join(path, 'config.yaml'))
+ if args.do_train:
+ config_base = get_base_config('finetuning')
+ config_update = OmegaConf.merge(config_base, config_new)
+ for key, val in vars(args).items():
+ if key in config_update.optimizer.keys():
+ OmegaConf.update(config_update, "optimizer.%s" % key, val, merge=False)
+ if key in config_update.experiment.keys():
+ OmegaConf.update(config_update, "experiment.%s" % key, val, merge=False)
+ else:
+ config_base = get_base_config('default')
+ config_update = OmegaConf.merge(config_base, config_new)
+
+ model = cls(config_update)
+ model.tokenizer = build_tokenizer(os.path.join(path, 'tokenizer'),
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+ print(model.cross_attention_idxs)
+ # print(next(model.cross_attention_layers[0].parameters()).is_cuda)
+
+ if args.dalle_path:
+ print("Loading model from pretrained checkpoint %s" % args.dalle_path)
+ # model.from_ckpt(args.model_name_or_path)
+ model.load_state_dict(torch.load(args.dalle_path)['model_state_dict'])
+ else:
+ print("Loading models from checkpoint %s" % path)
+ model.stage1.from_ckpt(os.path.join(path, 'stage1_last.ckpt'))
+ model.stage2.from_ckpt(os.path.join(path, 'stage2_last.ckpt'))
+
+ return model, config_update
+
+
+ def init_cross_attention(self, cross_attention_layers, hparams):
+ self.cross_attention_idxs = cross_attention_layers
+ self.cross_attention_layers = [CrossAttentionLayer(ctx_len=hparams.ctx_len_img + hparams.ctx_len_txt,
+ embed_dim=hparams.embed_dim,
+ n_heads=hparams.n_heads,
+ attn_bias=hparams.attn_bias,
+ resid_pdrop=hparams.resid_pdrop,
+ attn_pdrop=hparams.attn_pdrop) for i in cross_attention_layers]
+
+
+ def forward(self,
+ images: torch.FloatTensor,
+ src_images: Optional[torch.FloatTensor],
+ texts: Optional[torch.LongTensor],
+ **kwargs,
+ ):
+
+ #{"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 'src':src}
+
+ # print(images.shape, src_images.shape, texts.shape)
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+ src_codes = self.stage1.get_codes(src_images).detach()
+
+ pos_enc_tokens = get_positional_encoding(texts, mode='1d')
+ codes = codes.clone().detach()
+ pos_enc_code = get_positional_encoding(codes, mode='1d')
+ src_codes = src_codes.clone().detach()
+ src_pos_enc_code = get_positional_encoding(src_codes, mode='1d')
+ # codes = codes.unsqueeze(-1)
+ # pos_enc_code = pos_enc_code.unsqueeze(-1)
+ # print(images.shape, codes.shape, texts.shape)
+ logits_img, logits_txt = self.stage2.forward_with_context(codes, texts,
+ pos_enc_code, pos_enc_tokens, src_codes, src_pos_enc_code,
+ self.cross_attention_idxs, self.cross_attention_layers)
+ # print(logits_img.shape, logits_txt.shape, codes.shape, texts.shape)
+ return logits_img, logits_txt, codes
+
+ @torch.no_grad()
+ def sampling(self,
+ prompt: torch.LongTensor,
+ source: torch.FloatTensor,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True) -> torch.FloatTensor:
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if type(prompt) == str:
+ tokens = self.tokenizer.encode(prompt)
+ tokens = torch.LongTensor(tokens.ids)
+ else:
+ tokens = prompt
+
+ tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ # Check if the encoding works as intended
+ # print(self.tokenizer.decode_batch(tokens.tolist(), skip_special_tokens=True)[0])
+
+ tokens = tokens.to(device)
+ source = source.to(device)
+
+ with autocast(enabled=False):
+ src_codes = self.stage1.get_codes(source).detach()
+ src_codes = torch.repeat_interleave(src_codes, num_candidates, dim=0)
+
+ codes = sampling_conditional(self.stage2,
+ self.cross_attention_idxs,
+ self.cross_attention_layers,
+ tokens,
+ src_codes,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16)
+ codes = codes.view(num_candidates, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+ def training_step(self, batch, batch_idx):
+ images, texts = batch
+ logits_img, logits_txt, codes = self(images, texts)
+
+ loss_img = F.cross_entropy(logits_img.view(-1, logits_img.shape[-1]), codes.view(-1))
+ loss_txt = F.cross_entropy(logits_txt.view(-1, logits_txt.shape[-1]), texts[:, 1:].reshape(-1))
+ self.log("train/loss_img", loss_img, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ self.log("train/loss_txt", loss_txt, on_step=True, on_epoch=True, prog_bar=False, logger=True)
+ return loss_img + loss_txt
+
+ def validation_step(self, batch, batch_idx):
+ images, texts = batch
+ logits_img, logits_txt, codes = self(images, texts)
+ # print(logits_img.shape, logits_txt.shape, codes.shape, texts.shape)
+
+ loss_img = F.cross_entropy(logits_img.view(-1, logits_img.shape[-1]), codes.view(-1))
+ loss_txt = F.cross_entropy(logits_txt.view(-1, logits_txt.shape[-1]), texts[:, 1:].reshape(-1))
+ self.log("val/loss_img", loss_img, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ self.log("val/loss_txt", loss_txt, on_step=False, on_epoch=True, prog_bar=False, logger=True)
+ return loss_img + loss_txt
+
+ @torch.no_grad()
+ def predict_step(self, batch, batch_idx):
+ orig_images, texts = batch
+ # concatenate the list of prompts (split by n_head) for better downstream processing
+ past_key_values_prompt = self.get_prompt(bsz=5)
+ past_key_values_prompt = torch.cat([x.unsqueeze(0) for x in past_key_values_prompt], dim=0)
+ images = []
+ for t in texts:
+ pixels = self.sampling(t, past_key_values_prompt, top_k=64, num_candidates=5).cpu().numpy()
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+ images.append(pixels)
+ # images.extend([p for p in pixels])
+ # print([i.shape for i in images])
+
+ save_image(orig_images, images, './out/images/', batch_idx)
+
+
+class PromptConditionalDalle(Dalle):
+ """Classification Head for transformer encoders"""
+ def __init__(self, config):
+ super().__init__(config)
+ print('Initializing the Conditional Dalle model')
+
+ self.config = config
+
+ print('Setting up Cross-attention Layers')
+ self.init_cross_attention(list(range(2,42,3)), config.stage2.hparams)
+
+ self.n_embd = config.stage2.hparams.embed_dim
+ self.preseqlen = config.story.preseqlen
+ self.prefix_dropout = config.story.prefix_dropout
+
+ # DIFFERENT PARAMETRIZATION:
+
+ print('[Full prompt-tuning Setting :) ]')
+ self.input_tokens = torch.arange(self.preseqlen).long()
+ self.wte = nn.Embedding(self.preseqlen, self.n_embd)
+ self.control_trans = nn.Sequential(
+ nn.Linear(self.n_embd, self.n_embd),
+ nn.Tanh(),
+ nn.Linear(self.n_embd, self.n_embd))
+ self.get_prompt = self.get_prompt_p5
+ self.dropout = nn.Dropout(self.prefix_dropout)
+
+ ###### NUM PARAMS #########
+ total_param = 0
+ for name, param in self.named_parameters():
+ # print(param.shape)
+ total_param += param.numel()
+ print('Total parameters is {}'.format(total_param))
+
+ @classmethod
+ def from_pretrained(cls, args) -> Tuple[nn.Module, OmegaConf]:
+
+ # if not args.model_name_or_path:
+ # args.model_name_or_path = args.prefix_model_name_or_path
+
+ path = args.prefix_model_name_or_path
+ path = _MODELS[path] if path in _MODELS else path
+ path = utils.realpath_url_or_path(path, root=os.path.expanduser("~/.cache/minDALL-E"))
+
+ config_new = OmegaConf.load(os.path.join(path, 'config.yaml'))
+ if args.do_train:
+ config_base = get_base_config('story')
+ config_update = OmegaConf.merge(config_base, config_new)
+ for key, val in vars(args).items():
+ if key in config_update.story.keys():
+ OmegaConf.update(config_update, "story.%s" % key, val, merge=False)
+ if key in config_update.optimizer.keys():
+ OmegaConf.update(config_update, "optimizer.%s" % key, val, merge=False)
+ if key in config_update.experiment.keys():
+ OmegaConf.update(config_update, "experiment.%s" % key, val, merge=False)
+ else:
+ config_base = get_base_config('default')
+ config_update = OmegaConf.merge(config_base, config_new)
+
+ model = cls(config_update)
+ model.tokenizer = build_tokenizer(os.path.join(path, 'tokenizer'),
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+ print(model.cross_attention_idxs)
+ # print(next(model.cross_attention_layers[0].parameters()).is_cuda)
+
+ if args.model_name_or_path:
+ print("Loading model from pretrained checkpoint %s" % args.model_name_or_path)
+ # model.from_ckpt(args.model_name_or_path)
+ try:
+ model.load_state_dict(torch.load(args.model_name_or_path)['state_dict'])
+ except KeyError:
+ model.load_state_dict(torch.load(args.model_name_or_path)['model_state_dict'])
+
+ else:
+ print("Loading models from checkpoint %s" % path)
+ model.stage1.from_ckpt(os.path.join(path, 'stage1_last.ckpt'))
+ model.stage2.from_ckpt(os.path.join(path, 'stage2_last.ckpt'))
+
+ return model, config_update
+
+
+ def init_cross_attention(self, cross_attention_layers, hparams):
+ self.cross_attention_idxs = cross_attention_layers
+ self.cross_attention_layers = [CrossAttentionLayer(ctx_len=hparams.ctx_len_img + hparams.ctx_len_txt,
+ embed_dim=hparams.embed_dim,
+ n_heads=hparams.n_heads,
+ attn_bias=hparams.attn_bias,
+ resid_pdrop=hparams.resid_pdrop,
+ attn_pdrop=hparams.attn_pdrop) for i in cross_attention_layers]
+
+ def get_prompt_p5(self, bsz=None, eval=False):
+ input_tokens = self.input_tokens.unsqueeze(0).expand(bsz, -1).to(self.device)
+ temp_control = self.wte(input_tokens)
+ past_key_values = self.control_trans(temp_control) #bsz, seqlen, layer*emb
+ if not eval:
+ past_key_values = self.dropout(past_key_values)
+ return past_key_values
+
+ def forward(self,
+ images: torch.FloatTensor,
+ src_images: Optional[torch.FloatTensor],
+ texts: Optional[torch.LongTensor],
+ **kwargs,
+ ):
+
+ #{"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 'src':src}
+
+ # print(images.shape, src_images.shape, texts.shape)
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+ src_codes = self.stage1.get_codes(src_images).detach()
+
+ B, C, H, W = images.shape
+ prompt = self.get_prompt(bsz=B)
+ pos_enc_prompt = get_positional_encoding(self.input_tokens.unsqueeze(0).expand(B, -1).to(self.device), mode='1d')
+
+ pos_enc_tokens = get_positional_encoding(texts, mode='1d')
+ codes = codes.clone().detach()
+ pos_enc_code = get_positional_encoding(codes, mode='1d')
+ src_codes = src_codes.clone().detach()
+ src_pos_enc_code = get_positional_encoding(src_codes, mode='1d')
+ # codes = codes.unsqueeze(-1)
+ # pos_enc_code = pos_enc_code.unsqueeze(-1)
+ # print(images.shape, codes.shape, texts.shape)
+ logits_img, logits_txt = self.stage2.forward_with_context(codes, texts,
+ pos_enc_code, pos_enc_tokens, src_codes, src_pos_enc_code,
+ self.cross_attention_idxs, self.cross_attention_layers,
+ prompt=prompt, pos_prompt=pos_enc_prompt)
+ # print(logits_img.shape, logits_txt.shape, codes.shape, texts.shape)
+ return logits_img, logits_txt, codes
+
+ @torch.no_grad()
+ def sampling(self,
+ tokens: torch.LongTensor,
+ prompt: torch.LongTensor,
+ source: torch.FloatTensor,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True,
+ labels=None) -> torch.FloatTensor:
+
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if type(tokens) == str:
+ tokens = self.tokenizer.encode(prompt)
+ tokens = torch.LongTensor(tokens.ids)
+ else:
+ pass
+
+ tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ # Check if the encoding works as intended
+ # print(self.tokenizer.decode_batch(tokens.tolist(), skip_special_tokens=True)[0])
+
+ tokens = tokens.to(device)
+ source = source.to(device)
+
+ pos_enc_prompt = get_positional_encoding(self.input_tokens.unsqueeze(0).expand(num_candidates, -1).to(self.device), mode='1d')
+
+ with autocast(enabled=False):
+ src_codes = self.stage1.get_codes(source).detach()
+ src_codes = torch.repeat_interleave(src_codes, num_candidates, dim=0)
+
+ codes = sampling_conditional(self.stage2,
+ self.cross_attention_idxs,
+ self.cross_attention_layers,
+ tokens,
+ src_codes,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+
+ codes = codes.view(num_candidates, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+
+ @torch.no_grad()
+ def predict_step(self, batch, batch_idx, return_images=False):
+ orig_images, texts = batch
+ # concatenate the list of prompts (split by n_head) for better downstream processing
+
+ # extra for checks
+ logits_img, logits_txt, codes = self(orig_images, texts)
+ pred = torch.argmax(logits_img.view(-1, logits_img.shape[-1]), dim=-1)
+ bs = orig_images.shape[0]
+ pred = pred.view(bs, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(pred) * 0.5 + 0.5, 0, 1).cpu().numpy() # [B, 256, 256]
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+
+ prompt = self.get_prompt(bsz=5, eval=True)
+
+ images = []
+ for t in texts:
+ pixels = self.sampling(t, prompt, top_k=64, num_candidates=5, labels=codes[i]).cpu().numpy()
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+ images.append(pixels)
+ # images.extend([p for p in pixels])
+ # print([i.shape for i in images])
+
+ if return_images:
+ return images
+ else:
+ save_image(orig_images, pixels, './out/images/pororo_story', batch_idx+10)
+ save_image(orig_images, images, './out/images/pororo_story', batch_idx)
+
+
+class StoryDalle(Dalle):
+ """Base model with story block"""
+ def __init__(self, config):
+ super().__init__(config)
+ print('Initializing the Conditional Dalle model')
+
+ self.config = config
+
+ self.story_linear = nn.Linear(config.story.sent_embed, config.stage2.hparams.embed_dim)
+ self.story_block = Block(ctx_len=config.story.story_len,
+ embed_dim=config.stage2.hparams.embed_dim,
+ n_heads=config.stage2.hparams.n_heads,
+ mlp_bias=config.stage2.hparams.mlp_bias,
+ attn_bias=config.stage2.hparams.attn_bias,
+ resid_pdrop=config.stage2.hparams.resid_pdrop,
+ attn_pdrop=config.stage2.hparams.attn_pdrop,
+ gelu_use_approx=config.stage2.hparams.gelu_use_approx)
+
+ if self.config.story.prompt:
+ self.n_embd = config.stage2.hparams.embed_dim
+ self.preseqlen = config.story.preseqlen
+ self.prefix_dropout = config.story.prefix_dropout
+
+ # DIFFERENT PARAMETRIZATION:
+
+ print('[Full prompt-tuning Setting :) ]')
+ self.input_tokens = torch.arange(self.preseqlen).long()
+ self.wte = nn.Embedding(self.preseqlen, self.n_embd)
+ self.control_trans = nn.Sequential(
+ nn.Linear(self.n_embd, self.n_embd),
+ nn.Tanh(),
+ nn.Linear(self.n_embd, self.n_embd))
+ self.get_prompt = self.get_prompt_p5
+ self.dropout = nn.Dropout(self.prefix_dropout)
+
+ if self.config.story.condition:
+ print('Setting up Cross-attention Layers')
+ self.init_cross_attention(list(range(2,42,3)), config.stage2.hparams)
+
+ ###### NUM PARAMS #########
+ total_param = 0
+ for name, param in self.named_parameters():
+ # print(param.shape)
+ total_param += param.numel()
+ print('Total parameters is {}'.format(total_param))
+
+ @classmethod
+ def from_pretrained(cls, args) -> Tuple[nn.Module, OmegaConf]:
+
+ # if not args.model_name_or_path:
+ # args.model_name_or_path = args.prefix_model_name_or_path
+
+ path = args.prefix_model_name_or_path
+ path = _MODELS[path] if path in _MODELS else path
+ path = utils.realpath_url_or_path(path, root=os.path.expanduser("~/.cache/minDALL-E"))
+
+ config_new = OmegaConf.load(os.path.join(path, 'config.yaml'))
+ # if args.do_train:
+ config_base = get_base_config('story')
+ config_update = OmegaConf.merge(config_base, config_new)
+ for key, val in vars(args).items():
+ if key in config_update.story.keys():
+ OmegaConf.update(config_update, "story.%s" % key, val, merge=False)
+ if key in config_update.optimizer.keys():
+ OmegaConf.update(config_update, "optimizer.%s" % key, val, merge=False)
+ if key in config_update.experiment.keys():
+ OmegaConf.update(config_update, "experiment.%s" % key, val, merge=False)
+ # else:
+ # config_base = get_base_config('story')
+ # config_update = OmegaConf.merge(config_base, config_new)
+ # print(next(model.cross_attention_layers[0].parameters()).is_cuda)
+
+ if args.model_name_or_path:
+ if 'pororo' in args.model_name_or_path:
+ config_update.stage2.vocab_size_txt = config_update.stage2.vocab_size_txt + 9
+ elif 'flintstones' in args.model_name_or_path:
+ config_update.stage2.vocab_size_txt = config_update.stage2.vocab_size_txt + 7
+ model = cls(config_update)
+ model_dir = os.path.dirname(args.model_name_or_path)
+ print(model_dir)
+ model.tokenizer = build_tokenizer(model_dir,
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+ print("Loaded tokenizer from finetuned checkpoint")
+ print(model.cross_attention_idxs)
+ print("Loading model from pretrained checkpoint %s" % args.model_name_or_path)
+ # model.from_ckpt(args.model_name_or_path)
+ try:
+ model.load_state_dict(torch.load(args.model_name_or_path)['state_dict'])
+ except KeyError:
+ model.load_state_dict(torch.load(args.model_name_or_path)['model_state_dict'])
+ else:
+ model = cls(config_update)
+ print(model.cross_attention_idxs)
+ print("Loading models from checkpoint %s" % path)
+ model.stage1.from_ckpt(os.path.join(path, 'stage1_last.ckpt'))
+ model.stage2.from_ckpt(os.path.join(path, 'stage2_last.ckpt'))
+
+ model.tokenizer = build_tokenizer(os.path.join(path, 'tokenizer'),
+ context_length=model.config_dataset.context_length,
+ lowercase=True,
+ dropout=None)
+
+
+ return model, config_update
+
+
+ def init_cross_attention(self, cross_attention_layers, hparams):
+ self.cross_attention_idxs = cross_attention_layers
+ self.cross_attention_layers = [CrossAttentionLayer(ctx_len=hparams.ctx_len_img + hparams.ctx_len_txt,
+ embed_dim=hparams.embed_dim,
+ n_heads=hparams.n_heads,
+ attn_bias=hparams.attn_bias,
+ resid_pdrop=hparams.resid_pdrop,
+ attn_pdrop=hparams.attn_pdrop) for i in cross_attention_layers]
+
+ def get_prompt_p5(self, bsz=None, eval=False):
+ input_tokens = self.input_tokens.unsqueeze(0).expand(bsz, -1).to(self.device)
+ temp_control = self.wte(input_tokens)
+ past_key_values = self.control_trans(temp_control) #bsz, seqlen, layer*emb
+ if not eval:
+ past_key_values = self.dropout(past_key_values)
+ return past_key_values
+
+ def forward(self,
+ images: torch.FloatTensor,
+ src_images: Optional[torch.FloatTensor],
+ texts: Optional[torch.LongTensor],
+ sent_embeds: Optional[torch.FloatTensor],
+ **kwargs,
+ ):
+
+ # print(images.shape, src_images.shape, texts.shape, sent_embeds.shape)
+
+ B, L, C, H, W = images.shape
+ images = images.view(B*L, C, H, W)
+ src_images = src_images.unsqueeze(1).expand(-1, L, -1, -1, -1).reshape(B*L, C, H, W)
+ sent_embeds = self.story_block(self.story_linear(sent_embeds)).view(B * L, -1).unsqueeze(1)
+ texts = texts.view(B * L, -1)
+
+ #{"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 'src':src}
+
+ with torch.no_grad():
+ with autocast(enabled=False):
+ codes = self.stage1.get_codes(images).detach()
+ src_codes = self.stage1.get_codes(src_images).detach()
+
+ B, C, H, W = images.shape
+
+ if self.config.story.prompt:
+ prompt = self.get_prompt(bsz=B)
+ prompt = torch.cat([prompt, sent_embeds], dim=1)
+ else:
+ prompt = sent_embeds
+
+ # dim = 0 for full-model finetuning??
+ pos_enc_prompt = get_positional_encoding(torch.arange(prompt.shape[1]).long().unsqueeze(0).expand(B, -1).to(self.device),
+ mode='1d')
+
+ pos_enc_tokens = get_positional_encoding(texts, mode='1d')
+ codes = codes.clone().detach()
+ pos_enc_code = get_positional_encoding(codes, mode='1d')
+ src_codes = src_codes.clone().detach()
+ src_pos_enc_code = get_positional_encoding(src_codes, mode='1d')
+ # codes = codes.unsqueeze(-1)
+ # pos_enc_code = pos_enc_code.unsqueeze(-1)
+ # print(images.shape, codes.shape, texts.shape)
+ if self.config.story.condition:
+ logits_img, logits_txt = self.stage2.forward_with_context(codes, texts,
+ pos_enc_code, pos_enc_tokens, src_codes, src_pos_enc_code,
+ self.cross_attention_idxs, self.cross_attention_layers,
+ prompt=prompt, pos_prompt=pos_enc_prompt)
+ else:
+ logits_img, logits_txt = self.stage2(codes, texts, pos_enc_code, pos_enc_tokens, prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+
+ # print(logits_img.shape, logits_txt.shape, codes.shape, texts.shape)
+ return logits_img, logits_txt, codes
+
+ @torch.no_grad()
+ def sampling(self,
+ tokens: torch.LongTensor,
+ source: torch.FloatTensor,
+ sent_embeds: torch.FloatTensor,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True,
+ labels=None,
+ prompt = None) -> torch.FloatTensor:
+
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if type(tokens) == str:
+ tokens = self.tokenizer.encode(tokens)
+ tokens = torch.LongTensor(tokens.ids)
+
+ # tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ # Check if the encoding works as intended
+ # print(self.tokenizer.decode_batch(tokens.tolist(), skip_special_tokens=True)[0])
+
+ tokens = tokens.to(device)
+ source = source.to(device)
+
+ # print(tokens.shape, sent_embeds.shape, prompt.shape)
+ B, L, _ = sent_embeds.shape
+ sent_embeds = self.story_block(self.story_linear(sent_embeds)).view(B * L, -1).unsqueeze(1)
+ if prompt is not None:
+ prompt = torch.cat([prompt, sent_embeds], dim=1)
+ else:
+ prompt = sent_embeds
+ pos_enc_prompt = get_positional_encoding(torch.arange(prompt.shape[1]).long().unsqueeze(0).expand(B*L, -1).to(self.device), mode='1d')
+
+ with autocast(enabled=False):
+ src_codes = self.stage1.get_codes(source).detach()
+ src_codes = torch.repeat_interleave(src_codes, self.config.story.story_len, dim=0)
+ print(tokens.shape, src_codes.shape, prompt.shape)
+ if self.config.story.condition:
+ codes = sampling_conditional(self.stage2,
+ self.cross_attention_idxs,
+ self.cross_attention_layers,
+ tokens,
+ src_codes,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+ else:
+ codes = sampling(self.stage2,
+ tokens,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+
+ codes = codes.view(self.config.story.story_len, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 256, 256]
+ return pixels
+
+ @torch.no_grad()
+ def sampling_batch(self,
+ tokens: torch.LongTensor,
+ source: torch.FloatTensor,
+ sent_embeds: torch.FloatTensor,
+ top_k: int = 256,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ num_candidates: int = 96,
+ device: str = 'cuda:0',
+ use_fp16: bool = True,
+ labels=None,
+ prompt=None, n_candidates=1) -> torch.FloatTensor:
+
+ self.stage1.eval()
+ self.stage2.eval()
+
+ if type(tokens) == str:
+ tokens = self.tokenizer.encode(tokens)
+ tokens = torch.LongTensor(tokens.ids)
+
+ # tokens = torch.repeat_interleave(tokens.unsqueeze(0), num_candidates, dim=0)
+
+ # Check if the encoding works as intended
+ # print(self.tokenizer.decode_batch(tokens.tolist(), skip_special_tokens=True)[0])
+
+ tokens = tokens.to(device)
+ source = source.to(device)
+
+ # print(tokens.shape, sent_embeds.shape, prompt.shape)
+ B, L, _ = sent_embeds.shape
+ sent_embeds = self.story_block(self.story_linear(sent_embeds)).view(B * L, -1).unsqueeze(1)
+ if prompt is not None:
+ prompt = torch.cat([prompt, sent_embeds], dim=1)
+ else:
+ prompt = sent_embeds
+ pos_enc_prompt = get_positional_encoding(
+ torch.arange(prompt.shape[1]).long().unsqueeze(0).expand(B * L, -1).to(self.device), mode='1d')
+
+ with autocast(enabled=False):
+ src_codes = self.stage1.get_codes(source).detach()
+
+ # repeat inputs to adjust to n_candidates and story length
+ src_codes = torch.repeat_interleave(src_codes, self.config.story.story_len * n_candidates, dim=0)
+ prompt = prompt.repeat(n_candidates, 1, 1)
+ pos_enc_prompt = pos_enc_prompt.repeat(n_candidates, 1)
+ tokens = tokens.repeat(n_candidates, 1)
+ print(tokens.shape, src_codes.shape, prompt.shape, pos_enc_prompt.shape)
+ if self.config.story.condition:
+ codes = sampling_conditional(self.stage2,
+ self.cross_attention_idxs,
+ self.cross_attention_layers,
+ tokens,
+ src_codes,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+ else:
+ codes = sampling(self.stage2,
+ tokens,
+ top_k=top_k,
+ top_p=top_p,
+ softmax_temperature=softmax_temperature,
+ use_fp16=use_fp16,
+ prompt=prompt,
+ pos_prompt=pos_enc_prompt)
+
+ codes = codes.view(self.config.story.story_len * n_candidates, 16, 16) # [B, 16, 16]
+ print(codes.shape)
+ pixels = torch.clamp(self.stage1.decode_code(codes) * 0.5 + 0.5, 0, 1) # [B, 3, 256, 256]
+ print(pixels.shape)
+ return pixels.view(n_candidates, self.config.story.story_len, pixels.shape[-3], pixels.shape[-2], pixels.shape[-1])
+
+
+ @torch.no_grad()
+ def predict_step(self, batch, batch_idx, return_images=False):
+ orig_images, texts = batch
+ # concatenate the list of prompts (split by n_head) for better downstream processing
+
+ # extra for checks
+ logits_img, logits_txt, codes = self(orig_images, texts)
+ pred = torch.argmax(logits_img.view(-1, logits_img.shape[-1]), dim=-1)
+ bs = orig_images.shape[0]
+ pred = pred.view(bs, 16, 16) # [B, 16, 16]
+ pixels = torch.clamp(self.stage1.decode_code(pred) * 0.5 + 0.5, 0, 1).cpu().numpy() # [B, 256, 256]
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+
+ prompt = self.get_prompt(bsz=5, eval=True)
+
+ images = []
+ for t in texts:
+ pixels = self.sampling(t, prompt, top_k=64, num_candidates=5, labels=codes[i]).cpu().numpy()
+ pixels = np.transpose(pixels, (0, 2, 3, 1))
+ images.append(pixels)
+ # images.extend([p for p in pixels])
+ # print([i.shape for i in images])
+
+ if return_images:
+ return images
+ else:
+ save_image(orig_images, pixels, './out/images/pororo_story', batch_idx+10)
+ save_image(orig_images, images, './out/images/pororo_story', batch_idx)
diff --git a/dalle/models/__pycache__/__init__.cpython-38.pyc b/dalle/models/__pycache__/__init__.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b7083a13daaa652d9124c5f444783d5d4280c39a
Binary files /dev/null and b/dalle/models/__pycache__/__init__.cpython-38.pyc differ
diff --git a/dalle/models/__pycache__/prefix_tuning_model.cpython-38.pyc b/dalle/models/__pycache__/prefix_tuning_model.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..83d91dbaec217977eb4cbed6b5b0b459bd5eaf06
Binary files /dev/null and b/dalle/models/__pycache__/prefix_tuning_model.cpython-38.pyc differ
diff --git a/dalle/models/__pycache__/tokenizer.cpython-38.pyc b/dalle/models/__pycache__/tokenizer.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..dd38e7b40c43a16cd12116f72a67f604886451ee
Binary files /dev/null and b/dalle/models/__pycache__/tokenizer.cpython-38.pyc differ
diff --git a/dalle/models/stage1/__pycache__/layers.cpython-38.pyc b/dalle/models/stage1/__pycache__/layers.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d1b9811eb9e47d731a39a71c634e7cbf4c0ee8b2
Binary files /dev/null and b/dalle/models/stage1/__pycache__/layers.cpython-38.pyc differ
diff --git a/dalle/models/stage1/__pycache__/vqgan.cpython-38.pyc b/dalle/models/stage1/__pycache__/vqgan.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..99209169533db6f7aff22406bee02dbfe757df77
Binary files /dev/null and b/dalle/models/stage1/__pycache__/vqgan.cpython-38.pyc differ
diff --git a/dalle/models/stage1/layers.py b/dalle/models/stage1/layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..16c758c98089b6278190b7b52479df0eed941d9f
--- /dev/null
+++ b/dalle/models/stage1/layers.py
@@ -0,0 +1,373 @@
+# ------------------------------------------------------------------------------------
+# Modified from VQGAN (https://github.com/CompVis/taming-transformers)
+# Copyright (c) 2020 Patrick Esser and Robin Rombach and Björn Ommer. All Rights Reserved.
+# ------------------------------------------------------------------------------------
+
+import torch
+import torch.nn as nn
+from typing import Tuple, Optional
+
+
+def nonlinearity(x):
+ # swish
+ return x*torch.sigmoid(x)
+
+
+def Normalize(in_channels):
+ return torch.nn.GroupNorm(num_groups=32,
+ num_channels=in_channels,
+ eps=1e-6,
+ affine=True)
+
+
+class Upsample(nn.Module):
+ def __init__(self, in_channels, with_conv):
+ super().__init__()
+ self.with_conv = with_conv
+ if self.with_conv:
+ self.conv = torch.nn.Conv2d(in_channels,
+ in_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+
+ def forward(self, x):
+ x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
+ if self.with_conv:
+ x = self.conv(x)
+ return x
+
+
+class Downsample(nn.Module):
+ def __init__(self, in_channels, with_conv):
+ super().__init__()
+ self.with_conv = with_conv
+ if self.with_conv:
+ # no asymmetric padding in torch conv, must do it ourselves
+ self.conv = torch.nn.Conv2d(in_channels,
+ in_channels,
+ kernel_size=3,
+ stride=2,
+ padding=0)
+
+ def forward(self, x):
+ if self.with_conv:
+ pad = (0, 1, 0, 1)
+ x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
+ x = self.conv(x)
+ else:
+ x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
+ return x
+
+
+class ResnetBlock(nn.Module):
+ def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
+ dropout, temb_channels=512):
+ assert temb_channels == 0
+ super().__init__()
+ self.in_channels = in_channels
+ out_channels = in_channels if out_channels is None else out_channels
+ self.out_channels = out_channels
+ self.use_conv_shortcut = conv_shortcut
+
+ self.norm1 = Normalize(in_channels)
+ self.conv1 = torch.nn.Conv2d(in_channels,
+ out_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+ self.norm2 = Normalize(out_channels)
+ self.dropout = torch.nn.Dropout(dropout)
+ self.conv2 = torch.nn.Conv2d(out_channels,
+ out_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+ if self.in_channels != self.out_channels:
+ if self.use_conv_shortcut:
+ self.conv_shortcut = torch.nn.Conv2d(in_channels,
+ out_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+ else:
+ self.nin_shortcut = torch.nn.Conv2d(in_channels,
+ out_channels,
+ kernel_size=1,
+ stride=1,
+ padding=0)
+
+ def forward(self, x, temb=None):
+ assert temb is None
+
+ h = x
+ h = self.norm1(h)
+ h = nonlinearity(h)
+ h = self.conv1(h)
+
+ h = self.norm2(h)
+ h = nonlinearity(h)
+ h = self.dropout(h)
+ h = self.conv2(h)
+
+ if self.in_channels != self.out_channels:
+ if self.use_conv_shortcut:
+ x = self.conv_shortcut(x)
+ else:
+ x = self.nin_shortcut(x)
+ return x+h
+
+
+class AttnBlock(nn.Module):
+ def __init__(self, in_channels):
+ super().__init__()
+ self.in_channels = in_channels
+
+ self.norm = Normalize(in_channels)
+ self.q = torch.nn.Conv2d(in_channels,
+ in_channels,
+ kernel_size=1,
+ stride=1,
+ padding=0)
+ self.k = torch.nn.Conv2d(in_channels,
+ in_channels,
+ kernel_size=1,
+ stride=1,
+ padding=0)
+ self.v = torch.nn.Conv2d(in_channels,
+ in_channels,
+ kernel_size=1,
+ stride=1,
+ padding=0)
+ self.proj_out = torch.nn.Conv2d(in_channels,
+ in_channels,
+ kernel_size=1,
+ stride=1,
+ padding=0)
+
+ def forward(self, x):
+ h_ = x
+ h_ = self.norm(h_)
+ q = self.q(h_)
+ k = self.k(h_)
+ v = self.v(h_)
+
+ # compute attention
+ b, c, h, w = q.shape
+ q = q.reshape(b, c, h*w)
+ q = q.permute(0, 2, 1) # b,hw,c
+ k = k.reshape(b, c, h*w) # b,c,hw
+ w_ = torch.bmm(q, k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+ w_ = w_ * (int(c)**(-0.5))
+ w_ = torch.nn.functional.softmax(w_, dim=2)
+
+ # attend to values
+ v = v.reshape(b, c, h*w)
+ w_ = w_.permute(0, 2, 1) # b,hw,hw (first hw of k, second of q)
+ h_ = torch.bmm(v, w_) # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+ h_ = h_.reshape(b, c, h, w)
+
+ h_ = self.proj_out(h_)
+ return x+h_
+
+
+class Encoder(nn.Module):
+ def __init__(self,
+ *, # forced to use named arguments
+ ch: int,
+ out_ch: int,
+ ch_mult: Tuple[int] = (1, 2, 4, 8),
+ num_res_blocks: int,
+ attn_resolutions: Tuple[int],
+ pdrop: float = 0.0,
+ resamp_with_conv: bool = True,
+ in_channels: int,
+ resolution: int,
+ z_channels: int,
+ double_z: Optional[bool] = None) -> None:
+ super().__init__()
+ self.ch = ch
+ self.temb_ch = 0
+ self.num_resolutions = len(ch_mult)
+ self.num_res_blocks = num_res_blocks
+ self.resolution = resolution
+ self.in_channels = in_channels
+
+ # downsampling
+ self.conv_in = torch.nn.Conv2d(in_channels,
+ self.ch,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+
+ curr_res = resolution
+ in_ch_mult = (1,)+tuple(ch_mult)
+ self.down = nn.ModuleList()
+ for i_level in range(self.num_resolutions):
+ block = nn.ModuleList()
+ attn = nn.ModuleList()
+ block_in = ch*in_ch_mult[i_level]
+ block_out = ch*ch_mult[i_level]
+ for i_block in range(self.num_res_blocks):
+ block.append(ResnetBlock(in_channels=block_in,
+ out_channels=block_out,
+ temb_channels=self.temb_ch,
+ dropout=pdrop))
+ block_in = block_out
+ if curr_res in attn_resolutions:
+ attn.append(AttnBlock(block_in))
+ down = nn.Module()
+ down.block = block
+ down.attn = attn
+ if i_level != self.num_resolutions-1:
+ down.downsample = Downsample(block_in, resamp_with_conv)
+ curr_res = curr_res // 2
+ self.down.append(down)
+
+ # middle
+ self.mid = nn.Module()
+ self.mid.block_1 = ResnetBlock(in_channels=block_in,
+ out_channels=block_in,
+ temb_channels=self.temb_ch,
+ dropout=pdrop)
+ self.mid.attn_1 = AttnBlock(block_in)
+ self.mid.block_2 = ResnetBlock(in_channels=block_in,
+ out_channels=block_in,
+ temb_channels=self.temb_ch,
+ dropout=pdrop)
+
+ # end
+ self.norm_out = Normalize(block_in)
+ self.conv_out = torch.nn.Conv2d(block_in,
+ 2*z_channels if double_z else z_channels,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+
+ def forward(self, x):
+ assert x.shape[2] == x.shape[3] == self.resolution, \
+ "{}, {}".format(x.shape, self.resolution)
+
+ # downsampling
+ h = self.conv_in(x)
+ for i_level in range(self.num_resolutions):
+ for i_block in range(self.num_res_blocks):
+ h = self.down[i_level].block[i_block](h)
+ if len(self.down[i_level].attn) > 0:
+ h = self.down[i_level].attn[i_block](h)
+ if i_level != self.num_resolutions-1:
+ h = self.down[i_level].downsample(h)
+
+ # middle
+ h = self.mid.block_1(h)
+ h = self.mid.attn_1(h)
+ h = self.mid.block_2(h)
+
+ # end
+ h = self.norm_out(h)
+ h = nonlinearity(h)
+ h = self.conv_out(h)
+ return h
+
+
+class Decoder(nn.Module):
+ def __init__(self,
+ *, # forced to use named arguments
+ ch: int,
+ out_ch: int,
+ ch_mult: Tuple[int] = (1, 2, 4, 8),
+ num_res_blocks: int,
+ attn_resolutions: Tuple[int],
+ pdrop: float = 0.0,
+ resamp_with_conv: bool = True,
+ in_channels: int,
+ resolution: int,
+ z_channels: int,
+ double_z: bool) -> None:
+ super().__init__()
+ self.ch = ch
+ self.temb_ch = 0
+ self.num_resolutions = len(ch_mult)
+ self.num_res_blocks = num_res_blocks
+ self.resolution = resolution
+ self.in_channels = in_channels
+
+ # compute in_ch_mult, block_in and curr_res at lowest res
+ block_in = ch*ch_mult[self.num_resolutions-1]
+ curr_res = resolution // 2**(self.num_resolutions-1)
+ self.z_shape = (1, z_channels, curr_res, curr_res)
+
+ # z to block_in
+ self.conv_in = torch.nn.Conv2d(z_channels,
+ block_in,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+
+ # middle
+ self.mid = nn.Module()
+ self.mid.block_1 = ResnetBlock(in_channels=block_in,
+ out_channels=block_in,
+ temb_channels=self.temb_ch,
+ dropout=pdrop)
+ self.mid.attn_1 = AttnBlock(block_in)
+ self.mid.block_2 = ResnetBlock(in_channels=block_in,
+ out_channels=block_in,
+ temb_channels=self.temb_ch,
+ dropout=pdrop)
+
+ # upsampling
+ self.up = nn.ModuleList()
+ for i_level in reversed(range(self.num_resolutions)):
+ block = nn.ModuleList()
+ attn = nn.ModuleList()
+ block_out = ch*ch_mult[i_level]
+ for i_block in range(self.num_res_blocks+1):
+ block.append(ResnetBlock(in_channels=block_in,
+ out_channels=block_out,
+ temb_channels=self.temb_ch,
+ dropout=pdrop))
+ block_in = block_out
+ if curr_res in attn_resolutions:
+ attn.append(AttnBlock(block_in))
+ up = nn.Module()
+ up.block = block
+ up.attn = attn
+ if i_level != 0:
+ up.upsample = Upsample(block_in, resamp_with_conv)
+ curr_res = curr_res * 2
+ self.up.insert(0, up) # prepend to get consistent order
+
+ # end
+ self.norm_out = Normalize(block_in)
+ self.conv_out = torch.nn.Conv2d(block_in,
+ out_ch,
+ kernel_size=3,
+ stride=1,
+ padding=1)
+
+ def forward(self, z):
+ assert z.shape[1:] == self.z_shape[1:]
+ self.last_z_shape = z.shape
+
+ # z to block_in
+ h = self.conv_in(z)
+
+ # middle
+ h = self.mid.block_1(h)
+ h = self.mid.attn_1(h)
+ h = self.mid.block_2(h)
+
+ # upsampling
+ for i_level in reversed(range(self.num_resolutions)):
+ for i_block in range(self.num_res_blocks+1):
+ h = self.up[i_level].block[i_block](h)
+ if len(self.up[i_level].attn) > 0:
+ h = self.up[i_level].attn[i_block](h)
+ if i_level != 0:
+ h = self.up[i_level].upsample(h)
+
+ h = self.norm_out(h)
+ h = nonlinearity(h)
+ h = self.conv_out(h)
+ return h
diff --git a/dalle/models/stage1/vqgan.py b/dalle/models/stage1/vqgan.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f03a4d02aa579275d58290bc4f3714fd58bfe00
--- /dev/null
+++ b/dalle/models/stage1/vqgan.py
@@ -0,0 +1,93 @@
+# ------------------------------------------------------------------------------------
+# Modified from VQGAN (https://github.com/CompVis/taming-transformers)
+# Copyright (c) 2020 Patrick Esser and Robin Rombach and Björn Ommer. All Rights Reserved.
+# ------------------------------------------------------------------------------------
+
+import torch
+import torch.nn as nn
+from typing import List, Tuple, Optional
+from einops import rearrange
+from omegaconf import OmegaConf
+from .layers import Encoder, Decoder
+
+
+class VectorQuantizer(nn.Module):
+ """
+ Simplified VectorQuantizer in the original VQGAN repository
+ by removing unncessary modules for sampling
+ """
+ def __init__(self, dim: int, n_embed: int, beta: float) -> None:
+ super().__init__()
+ self.n_embed = n_embed
+ self.dim = dim
+ self.beta = beta
+
+ self.embedding = nn.Embedding(self.n_embed, self.dim)
+ self.embedding.weight.data.uniform_(-1.0 / self.n_embed, 1.0 / self.n_embed)
+
+ def forward(self,
+ z: torch.FloatTensor) -> Tuple[torch.FloatTensor, torch.LongTensor]:
+ z = rearrange(z, 'b c h w -> b h w c').contiguous() # [B,C,H,W] -> [B,H,W,C]
+ z_flattened = z.view(-1, self.dim)
+
+ d = torch.sum(z_flattened ** 2, dim=1, keepdim=True) + \
+ torch.sum(self.embedding.weight**2, dim=1) - 2 * \
+ torch.einsum('bd,dn->bn', z_flattened, rearrange(self.embedding.weight, 'n d -> d n'))
+
+ min_encoding_indices = torch.argmin(d, dim=1)
+ z_q = self.embedding(min_encoding_indices).view(z.shape)
+ return z_q, min_encoding_indices
+
+ def get_codebook_entry(self,
+ indices: torch.LongTensor,
+ shape: Optional[List[int]] = None) -> torch.FloatTensor:
+ z_q = self.embedding(indices)
+ if shape is not None:
+ z_q = z_q.view(shape)
+ z_q = z_q.permute(0, 3, 1, 2).contiguous()
+ return z_q
+
+
+class VQGAN(nn.Module):
+ def __init__(self, n_embed: int, embed_dim: int, hparams: OmegaConf) -> None:
+ super().__init__()
+ self.encoder = Encoder(**hparams)
+ self.decoder = Decoder(**hparams)
+ self.quantize = VectorQuantizer(dim=embed_dim, n_embed=n_embed, beta=0.25)
+ self.quant_conv = torch.nn.Conv2d(hparams.z_channels, embed_dim, 1)
+ self.post_quant_conv = torch.nn.Conv2d(embed_dim, hparams.z_channels, 1)
+ self.latent_dim = hparams.attn_resolutions[0]
+
+ def forward(self, x: torch.FloatTensor) -> torch.FloatTensor:
+ quant = self.encode(x)
+ dec = self.decode(quant)
+ return dec
+
+ def encode(self, x: torch.FloatTensor) -> torch.FloatTensor:
+ h = self.encoder(x)
+ h = self.quant_conv(h)
+ quant = self.quantize(h)[0]
+ quant = rearrange(quant, 'b h w c -> b c h w').contiguous()
+ return quant
+
+ def decode(self, quant: torch.FloatTensor) -> torch.FloatTensor:
+ quant = self.post_quant_conv(quant)
+ dec = self.decoder(quant)
+ return dec
+
+ def decode_code(self, code: torch.LongTensor) -> torch.FloatTensor:
+ quant = self.quantize.get_codebook_entry(code)
+ quant = quant.permute(0, 3, 1, 2)
+ dec = self.decode(quant)
+ return dec
+
+ def get_codes(self, x: torch.FloatTensor) -> torch.LongTensor:
+ h = self.encoder(x)
+ h = self.quant_conv(h)
+ codes = self.quantize(h)[1].view(x.shape[0], self.latent_dim ** 2)
+ return codes
+
+ def from_ckpt(self, path: str, strict: bool = True) -> None:
+ ckpt = torch.load(path, map_location='cpu')['state_dict']
+ self.load_state_dict(ckpt, strict=strict)
+ print(f'{path} successfully restored..')
diff --git a/dalle/models/stage2/__pycache__/layers.cpython-38.pyc b/dalle/models/stage2/__pycache__/layers.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..475d1f6a8b3b21e66242e55980197e1e6269275c
Binary files /dev/null and b/dalle/models/stage2/__pycache__/layers.cpython-38.pyc differ
diff --git a/dalle/models/stage2/__pycache__/transformer.cpython-38.pyc b/dalle/models/stage2/__pycache__/transformer.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f096414ba66635c44b4c1ef2637c2270e4e1406a
Binary files /dev/null and b/dalle/models/stage2/__pycache__/transformer.cpython-38.pyc differ
diff --git a/dalle/models/stage2/layers.py b/dalle/models/stage2/layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..d0a60c297cbf94a0c7ac1946ac70f6e862f912e6
--- /dev/null
+++ b/dalle/models/stage2/layers.py
@@ -0,0 +1,216 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+# Modified from minGPT (https://github.com/karpathy/minGPT)
+# Copyright (c) 2020 Andrej Karpathy. All Rights Reserved.
+# ------------------------------------------------------------------------------------
+
+import math
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+
+
+class GELU(nn.Module):
+ def __init__(self, use_approx=False):
+ super().__init__()
+ self.use_approx = use_approx
+
+ def forward(self, x):
+ if self.use_approx:
+ return x * torch.sigmoid(1.702 * x)
+ else:
+ return F.gelu(x)
+
+
+class MultiHeadSelfAttention(nn.Module):
+
+ def __init__(self,
+ ctx_len: int,
+ embed_dim: int,
+ n_heads: int,
+ resid_pdrop: float,
+ attn_pdrop: float,
+ attn_bias: bool,
+ use_mask: bool = True):
+ super().__init__()
+ assert embed_dim % n_heads == 0
+
+ # key, query, value projections for all heads
+ self.key = nn.Linear(embed_dim, embed_dim, bias=attn_bias)
+ self.query = nn.Linear(embed_dim, embed_dim, bias=attn_bias)
+ self.value = nn.Linear(embed_dim, embed_dim, bias=attn_bias)
+
+ # regularization
+ self.attn_drop = nn.Dropout(attn_pdrop)
+ self.resid_drop = nn.Dropout(resid_pdrop)
+
+ # output projection
+ self.proj = nn.Linear(embed_dim, embed_dim, attn_bias)
+
+ self.n_heads = n_heads
+ self.ctx_len = ctx_len
+ self.use_mask = use_mask
+ if self.use_mask:
+ self.register_buffer("mask", torch.ones(ctx_len, ctx_len), persistent=False)
+ self.mask = torch.tril(self.mask).view(1, ctx_len, ctx_len)
+
+ def forward(self, x, use_cache=False, layer_past=None):
+ B, T, C = x.shape
+ x = x.transpose(0, 1).contiguous() # (B, T, C) -> (T, B, C)
+
+ # calculate query, key, values for all heads in batch and move head forward to be the batch dim
+ k = self.key(x).view(T, B*self.n_heads, C//self.n_heads).transpose(0, 1) # (B*nh, T, hs)
+ q = self.query(x).view(T, B*self.n_heads, C//self.n_heads).transpose(0, 1) # (B*nh, T, hs)
+ v = self.value(x).view(T, B*self.n_heads, C//self.n_heads).transpose(0, 1) # (B*nh, T, hs)
+
+ if use_cache:
+ present = torch.stack([k, v])
+
+ if layer_past is not None:
+ # print(layer_past.shape, k.shape, v.shape, q.shape)
+ # print("LayerPast shape", layer_past.shape)
+ past_key, past_value = layer_past
+
+ if len(past_key.shape) == 4:
+ _, _, seq_len, dim = past_key.shape
+ k = torch.cat([past_key.reshape(-1, seq_len, dim), k], dim=-2)
+ v = torch.cat([past_value.reshape(-1, seq_len, dim), v], dim=-2)
+ elif len(past_key.shape) == 3:
+ past_key, past_value = layer_past
+ k = torch.cat([past_key, k], dim=-2)
+ v = torch.cat([past_value, v], dim=-2)
+ else:
+ raise ValueError
+
+ if use_cache and layer_past is not None:
+ # Tensor shape below: (B * nh, 1, hs) X (B * nh, hs, K) -> (B * nh, 1, K)
+ att = torch.bmm(q, (k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))))
+ att = F.softmax(att, dim=-1)
+ att = self.attn_drop(att)
+ y = torch.bmm(att, v) # (B*nh, 1, K) X (B*nh, K, hs) -> (B*nh, 1, hs)
+ else:
+ # Tensor shape below: (B * nh, T, hs) X (B * nh, hs, T) -> (B * nh, T, T)
+ att = torch.bmm(q, (k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))))
+ if self.use_mask:
+ # TODO : Flip when not prompt tunign
+ # mask = self.mask if T == self.ctx_len else self.mask[:, :T, :T]
+ if T == self.ctx_len:
+ mask = self.mask
+ else:
+ mask = torch.tril(torch.ones(T, T)).view(1, T, T).to(att.device)
+ att = att.masked_fill(mask == 0, float('-inf'))
+ att = F.softmax(att, dim=-1)
+ att = self.attn_drop(att)
+ y = torch.bmm(att, v) # (B*nh, T, T) X (B*nh, T, hs) -> (B*nh, T, hs)
+ y = y.transpose(0, 1).contiguous().view(T, B, C) # re-assemble all head outputs side by side
+
+ # output projection
+ y = self.resid_drop(self.proj(y))
+ if use_cache:
+ return y.transpose(0, 1).contiguous(), present # (T, B, C) -> (B, T, C)
+ else:
+ return y.transpose(0, 1).contiguous() # (T, B, C) -> (B, T, C)
+
+ def forward_with_context(self, x, context, mask=None):
+ B, T, C = x.shape
+ x = x.transpose(0, 1).contiguous() # (B, T, C) -> (T, B, C)
+
+ # calculate query, key, values for all heads in batch and move head forward to be the batch dim
+ q = self.query(x).view(T, B*self.n_heads, C//self.n_heads).transpose(0, 1) # (B*nh, T, hs)
+
+ B, T_c, C = context.shape
+ k = self.key(context).view(T_c, B * self.n_heads, C // self.n_heads).transpose(0, 1) # (B*nh, T, hs)
+ v = self.value(context).view(T_c, B*self.n_heads, C//self.n_heads).transpose(0, 1) # (B*nh, T, hs)
+
+ # Tensor shape below: (B * nh, T, hs) X (B * nh, hs, Tc) -> (B * nh, T, Tc)
+ att = torch.bmm(q, (k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))))
+ att = F.softmax(att, dim=-1)
+ att = self.attn_drop(att)
+ y = torch.bmm(att, v) # (B*nh, T, T) X (B*nh, T, hs) -> (B*nh, T, hs)
+ y = y.transpose(0, 1).contiguous().view(T, B, C) # re-assemble all head outputs side by side
+
+ # output projection
+ y = self.resid_drop(self.proj(y)).transpose(0, 1).contiguous()
+ if mask is not None:
+ y = y.masked_fill(mask == 0, float('0.0'))
+ return y # (T, B, C) -> (B, T, C)
+
+
+class Block(nn.Module):
+
+ def __init__(self,
+ ctx_len: int,
+ embed_dim: int,
+ n_heads: int,
+ mlp_bias: bool,
+ attn_bias: bool,
+ resid_pdrop: bool,
+ attn_pdrop: bool,
+ gelu_use_approx: bool):
+ super().__init__()
+ self.ln1 = nn.LayerNorm(embed_dim)
+ self.ln2 = nn.LayerNorm(embed_dim)
+
+ self.attn = MultiHeadSelfAttention(ctx_len=ctx_len,
+ embed_dim=embed_dim,
+ n_heads=n_heads,
+ attn_pdrop=attn_pdrop,
+ resid_pdrop=resid_pdrop,
+ attn_bias=attn_bias,
+ use_mask=True)
+ self.mlp = nn.Sequential(
+ nn.Linear(embed_dim, 4 * embed_dim, bias=mlp_bias),
+ GELU(gelu_use_approx),
+ nn.Linear(4 * embed_dim, embed_dim, bias=mlp_bias),
+ nn.Dropout(resid_pdrop),
+ )
+
+ def forward(self, x, layer_past=None):
+ x = x + self.attn(self.ln1(x), layer_past=layer_past)
+ x = x + self.mlp(self.ln2(x))
+ return x
+
+ def sample(self, x, layer_past=None):
+ attn, present = self.attn(self.ln1(x), use_cache=True, layer_past=layer_past)
+ x = x + attn
+ x = x + self.mlp(self.ln2(x))
+ return x, present
+
+ def sample_with_context(self, x, context, context_mask, cross_attn_layer, layer_past=None):
+ attn, present = self.attn(self.ln1(x), use_cache=True, layer_past=layer_past)
+ x = x + attn
+ c_attn = cross_attn_layer(x, context, context_mask)
+ x = x + c_attn
+ x = x + self.mlp(self.ln2(x))
+ return x, present
+
+
+class CrossAttentionLayer(nn.Module):
+
+ def __init__(self,
+ ctx_len: int,
+ embed_dim: int,
+ n_heads: int,
+ attn_bias: bool,
+ resid_pdrop: bool,
+ attn_pdrop: bool):
+ super().__init__()
+
+ self.ln1 = nn.LayerNorm(embed_dim)
+ self.ln2 = nn.LayerNorm(embed_dim)
+ self.attn = MultiHeadSelfAttention(ctx_len=ctx_len,
+ embed_dim=embed_dim,
+ n_heads=n_heads,
+ attn_pdrop=attn_pdrop,
+ resid_pdrop=resid_pdrop,
+ attn_bias=attn_bias,
+ use_mask=False)
+
+ def forward(self, x, context, context_mask=None):
+ attn = self.attn.forward_with_context(self.ln1(x), self.ln2(context), context_mask)
+ # x = x + attn
+ # return x
+ return attn
\ No newline at end of file
diff --git a/dalle/models/stage2/transformer.py b/dalle/models/stage2/transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc74a2992813d65d364b5562e8912398af61135e
--- /dev/null
+++ b/dalle/models/stage2/transformer.py
@@ -0,0 +1,502 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+# Modified from minGPT (https://github.com/karpathy/minGPT)
+# Copyright (c) 2020 Andrej Karpathy. All Rights Reserved.
+# ------------------------------------------------------------------------------------
+
+import torch
+import torch.nn as nn
+from typing import Optional, Tuple, List
+from torch.cuda.amp import autocast
+from omegaconf import OmegaConf
+from .layers import Block
+
+class Transformer1d(nn.Module):
+
+ def __init__(self,
+ vocab_size_txt: int,
+ vocab_size_img: int,
+ hparams: OmegaConf) -> None:
+ super().__init__()
+ assert hparams.n_layers == hparams.n_dense_layers
+
+ # input embedding for image and text
+ self.tok_emb_img = nn.Embedding(vocab_size_img, hparams.embed_dim)
+ self.tok_emb_txt = nn.Embedding(vocab_size_txt, hparams.embed_dim)
+
+ self.pos_emb_img = nn.Embedding(hparams.ctx_len_img, hparams.embed_dim)
+ self.pos_emb_txt = nn.Embedding(hparams.ctx_len_txt, hparams.embed_dim)
+
+ self.drop = nn.Dropout(hparams.embd_pdrop)
+
+ # transformer blocks
+ self.blocks = [Block(ctx_len=hparams.ctx_len_img + hparams.ctx_len_txt,
+ embed_dim=hparams.embed_dim,
+ n_heads=hparams.n_heads,
+ mlp_bias=hparams.mlp_bias,
+ attn_bias=hparams.attn_bias,
+ resid_pdrop=hparams.resid_pdrop,
+ attn_pdrop=hparams.attn_pdrop,
+ gelu_use_approx=hparams.gelu_use_approx) for i in range(1, hparams.n_layers+1)]
+ self.blocks = nn.Sequential(*self.blocks)
+
+ # heads for image and text
+ self.ln_f = nn.LayerNorm(hparams.embed_dim)
+ self.head_img = nn.Linear(hparams.embed_dim, vocab_size_img, bias=False)
+ self.head_txt = nn.Linear(hparams.embed_dim, vocab_size_txt, bias=False)
+
+ self.ctx_len_img = hparams.ctx_len_img
+ self.ctx_len_txt = hparams.ctx_len_txt
+ self.n_layers = hparams.n_layers
+
+ self.apply(self._init_weights)
+
+
+ def _init_weights(self, module: nn.Module) -> None:
+ if isinstance(module, (nn.Linear, nn.Embedding)):
+ module.weight.data.normal_(mean=0.0, std=0.02)
+ if isinstance(module, nn.Linear) and module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, nn.LayerNorm):
+ module.bias.data.zero_()
+ module.weight.data.fill_(1.0)
+
+
+ def resize_token_embeddings(self, new_num_tokens):
+
+ old_num_tokens, old_embedding_dim = self.tok_emb_txt.weight.size()
+ new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)
+ new_embeddings.to(self.tok_emb_txt.weight.device, dtype=self.tok_emb_txt.weight.dtype)
+ self._init_weights(new_embeddings)
+ # numbers of tokens to copy
+ n = min(old_num_tokens, new_num_tokens)
+ new_embeddings.weight.data[:n, :] = self.tok_emb_txt.weight.data[:n, :]
+ self.tok_emb_txt = new_embeddings
+
+ self.resize_lm_head(new_num_tokens)
+ # TODO: also change config to reflect new vocab size
+
+ return new_embeddings
+
+
+ def resize_lm_head(
+ self, new_num_tokens: Optional[int] = None, transposed: Optional[bool] = False) -> nn.Linear:
+
+ old_num_tokens, old_lm_head_dim = (
+ self.head_txt.weight.size() if not transposed else self.head_txt.weight.t().size()
+ )
+ # Build new lm head
+ new_lm_head_shape = (old_lm_head_dim, new_num_tokens) if not transposed else (new_num_tokens, old_lm_head_dim)
+ has_new_lm_head_bias = self.head_txt.bias is not None
+ new_lm_head = nn.Linear(*new_lm_head_shape, bias=has_new_lm_head_bias)
+ new_lm_head = new_lm_head.to(self.head_txt.weight.device, dtype=self.head_txt.weight.dtype)
+
+ # initialize new lm head (in particular added tokens)
+ self._init_weights(new_lm_head)
+ num_tokens_to_copy = min(old_num_tokens, new_num_tokens)
+ # Copy old lm head weights to new lm head
+ if not transposed:
+ new_lm_head.weight.data[:num_tokens_to_copy, :] = self.head_txt.weight.data[:num_tokens_to_copy, :]
+ else:
+ new_lm_head.weight.data[:, :num_tokens_to_copy] = self.head_txt.weight.data[:, :num_tokens_to_copy]
+
+ # Copy bias weights to new lm head
+ if has_new_lm_head_bias:
+ new_lm_head.bias.data[:num_tokens_to_copy] = self.head_txt.bias.data[:num_tokens_to_copy]
+
+ self.head_txt = new_lm_head
+
+ return new_lm_head
+
+
+ def forward(self,
+ images: torch.LongTensor,
+ texts: torch.LongTensor,
+ pos_images: torch.LongTensor,
+ pos_texts: torch.LongTensor,
+ past: Optional[List[torch.Tensor]] = None,
+ prompt: Optional[List[torch.Tensor]] = None,
+ pos_prompt: Optional[List[torch.Tensor]] = None) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
+
+
+ B, T = images.shape
+ _, N = texts.shape
+
+ assert T <= self.ctx_len_img, "Already reached the maximum context length (image)."
+ assert N == self.ctx_len_txt, "Already reached the maximum context length (text)."
+
+ texts = self.tok_emb_txt(texts)
+ images = self.tok_emb_img(images)
+
+ texts = texts + self.pos_emb_txt(pos_texts)
+ images = images + self.pos_emb_img(pos_images)
+
+ if prompt is not None:
+ prompt = prompt + self.pos_emb_txt(pos_prompt)
+ texts = torch.cat([prompt, texts], dim=1).contiguous()
+ P = prompt.shape[1]
+
+ x = torch.cat([texts, images], dim=1).contiguous()
+ x = self.drop(x)
+
+ # x = self.blocks(x)
+ for i, block in enumerate(self.blocks):
+ x, _ = block.sample(x, layer_past=None if past is None else past[i])
+
+ x = self.ln_f(x)
+
+ if prompt is not None:
+ texts = x[:, P:N+P-1].contiguous()
+ images = x[:, N+P-1:-1].contiguous()
+ else:
+ texts = x[:, :N-1].contiguous()
+ images = x[:, N-1:-1].contiguous()
+
+ logits_txt = self.head_txt(texts)
+ logits_img = self.head_img(images)
+ return logits_img, logits_txt
+
+ def forward_with_context(self,
+ images: torch.LongTensor,
+ texts: torch.LongTensor,
+ pos_images: torch.LongTensor,
+ pos_texts: torch.LongTensor,
+ src_images: torch.LongTensor,
+ src_pos_images: torch.LongTensor,
+ cross_attention_idxs: List,
+ cross_attention_layers,
+ past: Optional[List[torch.Tensor]] = None,
+ prompt: Optional[List[torch.Tensor]] = None,
+ pos_prompt: Optional[List[torch.Tensor]] = None) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
+
+
+ B, T = images.shape
+ _, N = texts.shape
+
+ assert T <= self.ctx_len_img, "Already reached the maximum context length (image)."
+ assert N == self.ctx_len_txt, "Already reached the maximum context length (text)."
+
+ texts = self.tok_emb_txt(texts)
+ images = self.tok_emb_img(images)
+ src_images = self.tok_emb_img(src_images)
+
+ texts = texts + self.pos_emb_txt(pos_texts)
+ images = images + self.pos_emb_img(pos_images)
+ src_images = src_images + self.pos_emb_img(src_pos_images)
+
+ if prompt is not None:
+ prompt = prompt + self.pos_emb_txt(pos_prompt)
+ texts = torch.cat([prompt, texts], dim=1).contiguous()
+ P = prompt.shape[1]
+ else:
+ P = 0
+
+ x = torch.cat([texts, images], axis=1).contiguous()
+ x = self.drop(x)
+
+ # prepare mask
+ mask = torch.zeros_like(x[0])
+ mask[self.ctx_len_txt+P-1:, :].fill_(1.0)
+ mask = mask.unsqueeze(0)
+
+ # print(images.shape, texts.shape, src_images.shape, mask.shape, x.shape)
+
+ # x = self.blocks(x)
+ for i, block in enumerate(self.blocks):
+ if i in cross_attention_idxs:
+ x, _ = block.sample_with_context(x, src_images, mask, cross_attention_layers[int(((i+1)/3)-1)], layer_past=None if past is None else past[i])
+ else:
+ x, _ = block.sample(x, layer_past=None if past is None else past[i])
+
+ x = self.ln_f(x)
+
+ if prompt is not None:
+ texts = x[:, P:N+P-1].contiguous()
+ images = x[:, N+P-1:-1].contiguous()
+ else:
+ texts = x[:, :N-1].contiguous()
+ images = x[:, N-1:-1].contiguous()
+
+ logits_txt = self.head_txt(texts)
+ logits_img = self.head_img(images)
+ return logits_img, logits_txt
+
+ @torch.no_grad()
+ def sampling(self,
+ images: torch.LongTensor,
+ texts: torch.LongTensor,
+ pos_images: torch.LongTensor,
+ pos_texts: torch.LongTensor,
+ use_fp16: bool = True,
+ past: Optional[List[torch.Tensor]] = None,
+ prompt: Optional[List[torch.Tensor]] = None,
+ pos_prompt: Optional[List[torch.Tensor]] = None) -> Tuple[torch.FloatTensor, List[torch.FloatTensor]]:
+
+ _, N = texts.shape
+ assert N == self.ctx_len_txt, "Already reached the maximum context length (text)."
+
+ with autocast(enabled=use_fp16):
+ if images is None:
+ # assert past is None
+
+ texts = self.tok_emb_txt(texts)
+ x = texts + self.pos_emb_txt(pos_texts)
+
+ if prompt is not None:
+ prompt = prompt + self.pos_emb_txt(pos_prompt)
+ texts = torch.cat([prompt, texts], dim=1).contiguous()
+
+ x = self.drop(x)
+
+ if past is not None:
+ past = torch.cat(past, dim=-2)
+
+ presents = []
+ for i, block in enumerate(self.blocks):
+ x, present = block.sample(x, layer_past=None if past is None else past[i])
+ presents.append(present)
+ x = self.ln_f(x)
+ x = x[:, N-1].contiguous()
+ logits = self.head_img(x)
+ else:
+ if past is None:
+ texts = self.tok_emb_txt(texts)
+ images = self.tok_emb_img(images)
+ texts = texts + self.pos_emb_txt(pos_texts)
+ images = images + self.pos_emb_img(pos_images)
+
+ if prompt is not None:
+ prompt = prompt + self.pos_emb_txt(pos_prompt)
+ texts = torch.cat([prompt, texts], dim=1).contiguous()
+
+ x = torch.cat([texts, images], axis=1).contiguous()
+ else:
+ images = self.tok_emb_img(images)
+ x = images + self.pos_emb_img(pos_images)
+ x = self.drop(x)
+
+ # if past is not None and len(past) > 1:
+ if past is not None:
+ past = torch.cat(past, dim=-2)
+ # print('Past', past.shape)
+ presents = []
+ # print(len(past), past[0].shape)
+ for i, block in enumerate(self.blocks):
+ x, present = block.sample(x, layer_past=None if past is None else past[i])
+ presents.append(present)
+ x = self.ln_f(x)
+ x = x[:, -1].contiguous()
+ logits = self.head_img(x)
+ return logits, presents
+
+ @torch.no_grad()
+ def sampling_with_context(self,
+ images: torch.LongTensor,
+ cross_attention_idxs,
+ cross_attention_layers,
+ texts: torch.LongTensor,
+ pos_images: torch.LongTensor,
+ pos_texts: torch.LongTensor,
+ source_image: torch.LongTensor,
+ use_fp16: bool = True,
+ past: Optional[List[torch.Tensor]] = None,
+ prompt: Optional[List[torch.Tensor]] = None,
+ pos_prompt: Optional[List[torch.Tensor]] = None
+ ) -> Tuple[torch.FloatTensor, List[torch.FloatTensor]]:
+
+ _, N = texts.shape
+ assert N == self.ctx_len_txt, "Already reached the maximum context length (text)."
+
+ if prompt is not None:
+ P = prompt.shape[1]
+ else:
+ P = 0
+
+ with autocast(enabled=use_fp16):
+ if images is None:
+ # assert past is None
+
+ texts = self.tok_emb_txt(texts)
+ texts = texts + self.pos_emb_txt(pos_texts)
+
+ if prompt is not None:
+ prompt = prompt + self.pos_emb_txt(pos_prompt)
+ texts = torch.cat([prompt, texts], dim=1).contiguous()
+
+ x = self.drop(texts)
+
+ if past is not None:
+ past = torch.cat(past, dim=-2)
+
+ # prepare mask
+ mask = torch.zeros_like(x[0])
+ mask[self.ctx_len_txt+P - 1:, :].fill_(1.0)
+ mask = mask.unsqueeze(0)
+
+ presents = []
+ for i, block in enumerate(self.blocks):
+ if i in cross_attention_idxs:
+ x, present = block.sample_with_context(x, source_image, mask,
+ cross_attention_layers[int(((i + 1) / 3) - 1)],
+ layer_past=None if past is None else past[i])
+ else:
+ x, present = block.sample(x, layer_past=None if past is None else past[i])
+ presents.append(present)
+ x = self.ln_f(x)
+ x = x[:, N-1].contiguous()
+ logits = self.head_img(x)
+ else:
+ if past is None:
+ texts = self.tok_emb_txt(texts)
+ images = self.tok_emb_img(images)
+ texts = texts + self.pos_emb_txt(pos_texts)
+ images = images + self.pos_emb_img(pos_images)
+
+ if prompt is not None:
+ prompt = prompt + self.pos_emb_txt(pos_prompt)
+ texts = torch.cat([prompt, texts], dim=1).contiguous()
+
+ x = torch.cat([texts, images], axis=1).contiguous()
+ else:
+ images = self.tok_emb_img(images)
+ x = images + self.pos_emb_img(pos_images)
+ x = self.drop(x)
+
+ # if past is not None and len(past) > 1:
+ if past is not None:
+ past = torch.cat(past, dim=-2)
+ presents = []
+
+ # prepare mask
+ mask = torch.zeros_like(x[0])
+ mask[self.ctx_len_txt+P - 1:, :].fill_(1.0)
+ mask = mask.unsqueeze(0)
+
+ # print(len(past), past[0].shape)
+ for i, block in enumerate(self.blocks):
+ if i in cross_attention_idxs:
+ x, present = block.sample_with_context(x, source_image, mask,
+ cross_attention_layers[int(((i + 1) / 3) - 1)],
+ layer_past=None if past is None else past[i])
+ else:
+ x, present = block.sample(x, layer_past=None if past is None else past[i])
+ presents.append(present)
+ x = self.ln_f(x)
+ x = x[:, -1].contiguous()
+ logits = self.head_img(x)
+ return logits, presents
+
+ def from_ckpt(self, path: str) -> None:
+ ckpt = torch.load(path, map_location='cpu')['state_dict']
+ self.load_state_dict(ckpt, strict=True)
+ print(f'{path} succesfully restored..')
+
+
+class iGPT(nn.Module):
+ def __init__(self,
+ vocab_size_img: int,
+ use_cls_cond: bool,
+ hparams: OmegaConf) -> None:
+ super().__init__()
+ self.use_cls_cond = use_cls_cond
+
+ # sos token embedding
+ if self.use_cls_cond:
+ self.sos = nn.Embedding(hparams.n_classes, hparams.embed_dim)
+ else:
+ self.sos = nn.Parameter(torch.randn(1, 1, hparams.embed_dim))
+
+ # input embedding
+ self.tok_emb_img = nn.Embedding(vocab_size_img, hparams.embed_dim)
+ self.pos_emb_img = nn.Embedding(hparams.ctx_len_img, hparams.embed_dim)
+
+ self.drop = nn.Dropout(hparams.embd_pdrop)
+
+ # transformer blocks
+ self.blocks = [Block(ctx_len=hparams.ctx_len_img + 1,
+ embed_dim=hparams.embed_dim,
+ n_heads=hparams.n_heads,
+ mlp_bias=hparams.mlp_bias,
+ attn_bias=hparams.attn_bias,
+ resid_pdrop=hparams.resid_pdrop,
+ attn_pdrop=hparams.attn_pdrop,
+ gelu_use_approx=hparams.gelu_use_approx) for i in range(1, hparams.n_layers+1)]
+ self.blocks = nn.Sequential(*self.blocks)
+
+ # head
+ self.ln_f = nn.LayerNorm(hparams.embed_dim)
+ self.head = nn.Linear(hparams.embed_dim, vocab_size_img, bias=False)
+
+ self.ctx_len_img = hparams.ctx_len_img
+ self.n_layers = hparams.n_layers
+
+ self.apply(self._init_weights)
+
+ def _init_weights(self, module: nn.Module) -> None:
+ if isinstance(module, (nn.Linear, nn.Embedding)):
+ module.weight.data.normal_(mean=0.0, std=0.02)
+ if isinstance(module, nn.Linear) and module.bias is not None:
+ module.bias.data.zero_()
+ elif isinstance(module, nn.LayerNorm):
+ module.bias.data.zero_()
+ module.weight.data.fill_(1.0)
+
+ @torch.no_grad()
+ def sampling(self,
+ sos: torch.FloatTensor,
+ codes: torch.LongTensor,
+ pos_codes: torch.LongTensor,
+ n_samples: int = 16,
+ use_fp16: bool = True,
+ past: Optional[torch.Tensor] = None) -> Tuple[torch.FloatTensor, List[torch.FloatTensor]]:
+ with autocast(enabled=use_fp16):
+ if codes is None:
+ assert past is None
+ xs = self.drop(sos)
+ presents = []
+ for i, block in enumerate(self.blocks):
+ xs, present = block.sample(xs, layer_past=None)
+ presents.append(present)
+ xs = self.ln_f(xs)
+ logits = self.head(xs)[:, -1]
+ else:
+ if past is None:
+ xs = self.tok_emb_img(codes) + self.pos_emb_img(pos_codes)
+ xs = torch.cat([sos, xs], dim=1)
+ else:
+ xs = self.tok_emb_img(codes) + self.pos_emb_img(pos_codes)
+ xs = self.drop(xs)
+
+ past = torch.cat(past, dim=-2) if past is not None else past
+ presents = []
+ for i, block in enumerate(self.blocks):
+ xs, present = block.sample(xs, layer_past=None if past is None else past[i])
+ presents.append(present)
+
+ xs = self.ln_f(xs)
+ logits = self.head(xs)[:, -1]
+ return logits, presents
+
+ def forward(self,
+ codes: torch.LongTensor,
+ labels: Optional[torch.LongTensor] = None) -> torch.FloatTensor:
+ B, T = codes.shape
+ xps = torch.arange(T, device=codes.device).repeat((B, 1))
+ sos = self.sos.repeat((B, 1, 1)) if labels is None else self.sos(labels).unsqueeze(1)
+
+ h = self.tok_emb_img(codes) + self.pos_emb_img(xps)
+ h = torch.cat([sos, h[:, :-1]], dim=1).contiguous()
+
+ h = self.drop(h)
+ h = self.blocks(h)
+ h = self.ln_f(h)
+ logits = self.head(h)
+ return logits
+
+ def from_ckpt(self, path: str, strict: bool = True) -> None:
+ ckpt = torch.load(path, map_location='cpu')['state_dict']
+ self.load_state_dict(ckpt, strict=strict)
+ print(f'{path} successfully restored..')
diff --git a/dalle/models/tokenizer.py b/dalle/models/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1187abc02d364d414b86cddf2f77180ece688197
--- /dev/null
+++ b/dalle/models/tokenizer.py
@@ -0,0 +1,35 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+
+import os
+from functools import partial
+from tokenizers import CharBPETokenizer
+
+
+def build_tokenizer(path: str,
+ context_length: int = 64,
+ *args,
+ **kwargs):
+ try:
+ from_file = partial(CharBPETokenizer.from_file,
+ vocab_filename=os.path.join(path, 'bpe-16k-vocab.json'),
+ merges_filename=os.path.join(path, 'bpe-16k-merges.txt'),
+ unk_token='[UNK]')
+ tokenizer = from_file(*args, **kwargs)
+ except:
+ from_file = partial(CharBPETokenizer.from_file,
+ vocab_filename=os.path.join(path, 'vocab.json'),
+ merges_filename=os.path.join(path, 'merges.txt'),
+ unk_token='[UNK]')
+ tokenizer = from_file(*args, **kwargs)
+
+ # tokenizer = from_file(*args, **kwargs)
+ tokenizer.add_special_tokens(['[PAD]'])
+ tokenizer.enable_padding(length=context_length,
+ pad_id=tokenizer.token_to_id('[PAD]'))
+ tokenizer.enable_truncation(max_length=context_length)
+ print(f'{path} successfully restored..')
+ return tokenizer
diff --git a/dalle/trainer_prefix.py b/dalle/trainer_prefix.py
new file mode 100644
index 0000000000000000000000000000000000000000..77e216d07bfe191c84b917db0bd4e02e593972e0
--- /dev/null
+++ b/dalle/trainer_prefix.py
@@ -0,0 +1,1629 @@
+import inspect
+import json
+import math
+import os
+import re
+import shutil
+import warnings
+from contextlib import contextmanager
+from pathlib import Path
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+
+from nltk import word_tokenize
+import numpy as np
+import torch
+from packaging import version
+from torch import nn
+from torch.utils.data.dataloader import DataLoader
+from torch.utils.data.dataset import Dataset
+from torch.utils.data.distributed import DistributedSampler
+from torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler
+from tqdm.auto import tqdm, trange
+from torch.nn.utils.rnn import pad_sequence
+import random
+
+from transformers.data.data_collator import DataCollator, DataCollatorWithPadding, default_data_collator
+from transformers.file_utils import is_datasets_available, is_torch_tpu_available
+from transformers.integrations import (
+ default_hp_search_backend,
+ is_comet_available,
+ is_optuna_available,
+ is_ray_available,
+ is_tensorboard_available,
+ is_wandb_available,
+ run_hp_search_optuna,
+ run_hp_search_ray,
+)
+
+from transformers.modeling_utils import PreTrainedModel
+from transformers.optimization import AdamW, get_linear_schedule_with_warmup, get_constant_schedule_with_warmup
+from transformers.tokenization_utils_base import PreTrainedTokenizerBase
+from transformers.trainer_utils import (
+ PREFIX_CHECKPOINT_DIR,
+ BestRun,
+ EvalPrediction,
+ EvaluationStrategy,
+ HPSearchBackend,
+ PredictionOutput,
+ TrainOutput,
+ default_compute_objective,
+ default_hp_space,
+ set_seed,
+)
+from transformers.training_args import TrainingArguments
+from transformers.utils import logging
+
+
+_use_native_amp = False
+_use_apex = False
+EPS = 1e-12
+INIT_GUMBEL_TEMP = 5.0
+
+control_lst = ['positive', 'negative', 'neutral']
+Control_Temp = {'positive': 3967, 'negative':4633, 'neutral':8500}
+control_Map = [torch.LongTensor([3967]), torch.LongTensor([4633]), torch.LongTensor([8500])]
+sst_lst = [(0, 2), (1, 3), (4,)]
+sst_standard = ["positive", "negative", "very positive", "very negative", "neutral"]
+# Control_?Map = {j:i for i, j in enumerate(control_lst)}
+
+# Check if Pytorch version >= 1.6 to switch between Native AMP and Apex
+if version.parse(torch.__version__) < version.parse("1.6"):
+ from transformers.file_utils import is_apex_available
+
+ if is_apex_available():
+ from apex import amp
+ _use_apex = True
+else:
+ _use_native_amp = True
+ from torch.cuda.amp import autocast
+
+if is_datasets_available():
+ import datasets
+
+if is_torch_tpu_available():
+ import torch_xla.core.xla_model as xm
+ import torch_xla.debug.metrics as met
+ import torch_xla.distributed.parallel_loader as pl
+
+if is_tensorboard_available():
+ try:
+ from torch.utils.tensorboard import SummaryWriter
+ except ImportError:
+ from tensorboardX import SummaryWriter
+
+if is_wandb_available():
+ import wandb
+
+if is_comet_available():
+ import comet_ml
+
+if is_optuna_available():
+ import optuna
+
+if is_ray_available():
+ from ray import tune
+
+
+logger = logging.get_logger(__name__)
+
+
+@contextmanager
+def torch_distributed_zero_first(local_rank: int):
+ """
+ Decorator to make all processes in distributed training wait for each local_master to do something.
+
+ Args:
+ local_rank (:obj:`int`): The rank of the local process.
+ """
+ if local_rank not in [-1, 0]:
+ torch.distributed.barrier()
+ yield
+ if local_rank == 0:
+ torch.distributed.barrier()
+
+def helper_token2bpe(offsets):
+ full_lst = []
+ for example_offset in offsets:
+ bpe2token = []
+ token2bpe = []
+ token_idx = -1
+ # print(example_offset)
+ for bpe_idx, (a,b) in enumerate(example_offset):
+ # print(token2bpe, a, b, bpe_idx)
+ if b - a > 0:
+ if a == 0:
+ # new token
+ token_idx += 1
+ bpe2token.append(token_idx)
+ token2bpe.append([])
+ token2bpe[-1].append(bpe_idx)
+ else:
+ # prev token.
+ bpe2token.append(token_idx)
+ token2bpe[-1].append(bpe_idx)
+ else:
+ bpe2token.append(None)
+ full_lst.append((bpe2token, token2bpe))
+ return full_lst
+
+class SequentialDistributedSampler(Sampler):
+ """
+ Distributed Sampler that subsamples indicies sequentially,
+ making it easier to collate all results at the end.
+
+ Even though we only use this sampler for eval and predict (no training),
+ which means that the model params won't have to be synced (i.e. will not hang
+ for synchronization even if varied number of forward passes), we still add extra
+ samples to the sampler to make it evenly divisible (like in `DistributedSampler`)
+ to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.
+ """
+
+ def __init__(self, dataset, num_replicas=None, rank=None):
+ if num_replicas is None:
+ if not torch.distributed.is_available():
+ raise RuntimeError("Requires distributed package to be available")
+ num_replicas = torch.distributed.get_world_size()
+ if rank is None:
+ if not torch.distributed.is_available():
+ raise RuntimeError("Requires distributed package to be available")
+ rank = torch.distributed.get_rank()
+ self.dataset = dataset
+ self.num_replicas = num_replicas
+ self.rank = rank
+ self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
+ self.total_size = self.num_samples * self.num_replicas
+
+ def __iter__(self):
+ indices = list(range(len(self.dataset)))
+
+ # add extra samples to make it evenly divisible
+ indices += indices[: (self.total_size - len(indices))]
+ assert (
+ len(indices) == self.total_size
+ ), f"Indices length {len(indices)} and total size {self.total_size} mismatched"
+
+ # subsample
+ indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]
+ assert (
+ len(indices) == self.num_samples
+ ), f"Indices length {len(indices)} and sample number {self.num_samples} mismatched"
+
+ return iter(indices)
+
+ def __len__(self):
+ return self.num_samples
+
+
+def get_tpu_sampler(dataset: Dataset):
+ if xm.xrt_world_size() <= 1:
+ return RandomSampler(dataset)
+ return DistributedSampler(dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())
+
+
+class Trainer_Prefix:
+ """
+ Trainer is a simple but feature-complete training and eval loop for PyTorch,
+ optimized for 🤗 Transformers.
+
+ Args:
+ model (:class:`~transformers.PreTrainedModel`, `optional`):
+ The model to train, evaluate or use for predictions. If not provided, a ``model_init`` must be passed.
+ args (:class:`~transformers.TrainingArguments`, `optional`):
+ The arguments to tweak for training. Will default to a basic instance of :class:`~transformers.TrainingArguments`
+ with the ``output_dir`` set to a directory named `tmp_trainer` in the current directory if not provided.
+ data_collator (:obj:`DataCollator`, `optional`):
+ The function to use to form a batch from a list of elements of :obj:`train_dataset` or
+ :obj:`eval_dataset`. Will default to :func:`~transformers.default_data_collator` if no ``tokenizer`` is
+ provided, an instance of :func:`~transformers.DataCollatorWithPadding` otherwise.
+ train_dataset (:obj:`torch.utils.data.dataset.Dataset`, `optional`):
+ The dataset to use for training. If it is an :obj:`datasets.Dataset`, columns not accepted by the
+ ``model.forward()`` method are automatically removed.
+ eval_dataset (:obj:`torch.utils.data.dataset.Dataset`, `optional`):
+ The dataset to use for evaluation. If it is an :obj:`datasets.Dataset`, columns not accepted by the
+ ``model.forward()`` method are automatically removed.
+ tokenizer (:class:`PreTrainedTokenizerBase`, `optional`):
+ The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the
+ maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an
+ interrupted training or reuse the fine-tuned model.
+ model_init (:obj:`Callable[[], PreTrainedModel]`, `optional`):
+ A function that instantiates the model to be used. If provided, each call to
+ :meth:`~transformers.Trainer.train` will start from a new instance of the model as given by this function.
+ compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
+ The function that will be used to compute metrics at evaluation. Must take a
+ :class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
+ tb_writer (:obj:`SummaryWriter`, `optional`):
+ Object to write to TensorBoard.
+ optimizers (:obj:`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR`, `optional`):
+ A tuple containing the optimizer and the scheduler to use. Will default to an instance of
+ :class:`~transformers.AdamW` on your model and a scheduler given by
+ :func:`~transformers.get_linear_schedule_with_warmup` controlled by :obj:`args`.
+ kwargs:
+ Deprecated keyword arguments.
+ """
+
+ def __init__(
+ self,
+ model: Optional[PreTrainedModel] = None,
+ model_gpt2 : Optional[PreTrainedModel] = None,
+ args: TrainingArguments = None,
+ data_collator: Optional[DataCollator] = None,
+ train_dataset: Optional[Dataset] = None,
+ eval_dataset: Optional[Dataset] = None,
+ tokenizer: Optional["PreTrainedTokenizerBase"] = None,
+ model_init: Callable[[], PreTrainedModel] = None,
+ compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,
+ tb_writer: Optional["SummaryWriter"] = None,
+ optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None),
+ task_mode: Optional[str] = None,
+ use_dropout: Optional[bool] = False,
+ distill: Optional[bool] = False,
+ matching_objective:Optional[str]= None,
+ finetuned_gpt2: Optional[PreTrainedModel] = None,
+ **kwargs,
+ ):
+ if args is None:
+ logger.info("No `TrainingArguments` passed, using the current path as `output_dir`.")
+ args = TrainingArguments("tmp_trainer")
+ self.args = args
+ # Seed must be set before instantiating the model when using model
+ set_seed(self.args.seed)
+ assert (
+ model is not None or model_init is not None
+ ), "You must provide a model to use `Trainer`, either by using the `model` argument or the `model_init` argument."
+ assert model_init is None
+ self.model = model.to(args.device) if model is not None else None
+ self.gpt2 = model_gpt2.to(args.device) if model_gpt2 is not None else None
+ default_collator = default_data_collator if tokenizer is None else DataCollatorWithPadding(tokenizer)
+ self.data_collator = data_collator if data_collator is not None else default_collator
+ self.train_dataset = train_dataset
+ self.eval_dataset = eval_dataset
+ self.tokenizer = tokenizer
+ self.model_init = model_init
+ self.compute_metrics = compute_metrics
+ self.optimizer, self.lr_scheduler = optimizers
+ self.task_mode = task_mode
+ self.use_dropout = use_dropout
+
+ self.curr_best_eval = 10000000.
+
+ self.distill = distill
+ if self.distill:
+ self.matching_objective = matching_objective
+ self.finetuned_gpt2 = finetuned_gpt2
+
+ if model_init is not None and (self.optimizer is not None or self.lr_scheduler is not None):
+ raise RuntimeError(
+ "Passing a `model_init` is incompatible with providing the `optimizers` argument."
+ "You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method."
+ )
+ self.tb_writer = tb_writer
+ self.log_history = []
+ if "prediction_loss_only" in kwargs:
+ warnings.warn(
+ "Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead.",
+ FutureWarning,
+ )
+ self.args.prediction_loss_only = kwargs.pop("prediction_loss_only")
+ assert kwargs == {}, f"Unexpected keyword arguments: {list(kwargs.keys())}."
+
+ if tb_writer is None and is_tensorboard_available() and self.is_world_process_zero():
+ self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)
+ if not is_tensorboard_available():
+ logger.warning(
+ "You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it."
+ )
+
+ # Will be set to True by `self._setup_loggers()` on first call to `self.log()`.
+ self._loggers_initialized = False
+
+ # Create output directory if needed
+ if self.is_world_process_zero():
+ os.makedirs(self.args.output_dir, exist_ok=True)
+ if is_torch_tpu_available():
+ # Set an xla_device flag on the model's config.
+ # We'll find a more elegant and not need to do this in the future.
+ self.model.config.xla_device = True
+ if not callable(self.data_collator) and callable(getattr(self.data_collator, "collate_batch", None)):
+ self.data_collator = self.data_collator.collate_batch
+ warnings.warn(
+ (
+ "The `data_collator` should now be a simple callable (function, class with `__call__`), classes "
+ + "with a `collate_batch` are deprecated and won't be supported in a future version."
+ ),
+ FutureWarning,
+ )
+
+ if is_datasets_available():
+ if isinstance(train_dataset, datasets.Dataset):
+ self._remove_unused_columns(self.train_dataset, description="training")
+ if isinstance(eval_dataset, datasets.Dataset):
+ self._remove_unused_columns(self.eval_dataset, description="evaluation")
+
+ self.global_step = None
+ self.epoch = None
+ self.total_flos = None
+ if self.args.fp16 and _use_native_amp:
+ self.scaler = torch.cuda.amp.GradScaler()
+ self.hp_search_backend = None
+ self.use_tune_checkpoints = False
+ if self.args.label_names is None:
+ self.args.label_names = (["labels"]
+ )
+
+ def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
+ if not self.args.remove_unused_columns:
+ return
+ # Inspect model forward signature to keep only the arguments it accepts.
+ signature = inspect.signature(self.model.forward)
+ signature_columns = list(signature.parameters.keys())
+ # Labels may be named label or label_ids, the default data collator handles that.
+ signature_columns += ["label", "label_ids"]
+ columns = [k for k in signature_columns if k in dataset.column_names]
+ ignored_columns = list(set(dataset.column_names) - set(signature_columns))
+ dset_description = "" if description is None else f"in the {description} set "
+ logger.info(
+ f"The following columns {dset_description}don't have a corresponding argument in `{self.model.__class__.__name__}.forward` and have been ignored: {', '.join(ignored_columns)}."
+ )
+ dataset.set_format(type=dataset.format["type"], columns=columns)
+
+ def _get_train_sampler(self) -> Optional[torch.utils.data.sampler.Sampler]:
+ if isinstance(self.train_dataset, torch.utils.data.IterableDataset):
+ return None
+ elif is_torch_tpu_available():
+ return get_tpu_sampler(self.train_dataset)
+ else:
+ return (
+ RandomSampler(self.train_dataset)
+ if self.args.local_rank == -1
+ else DistributedSampler(self.train_dataset)
+ )
+
+ def get_train_dataloader(self) -> DataLoader:
+ """
+ Returns the training :class:`~torch.utils.data.DataLoader`.
+
+ Will use no sampler if :obj:`self.train_dataset` is a :obj:`torch.utils.data.IterableDataset`, a random sampler
+ (adapted to distributed training if necessary) otherwise.
+
+ Subclass and override this method if you want to inject some custom behavior.
+ """
+ if self.train_dataset is None:
+ raise ValueError("Trainer: training requires a train_dataset.")
+ train_sampler = self._get_train_sampler()
+
+ return DataLoader(
+ self.train_dataset,
+ batch_size=self.args.train_batch_size,
+ sampler=train_sampler,
+ collate_fn=self.data_collator,
+ drop_last=self.args.dataloader_drop_last,
+ num_workers=self.args.dataloader_num_workers,
+ worker_init_fn=np.random.seed(self.args.seed)
+ )
+
+ def _get_eval_sampler(self, eval_dataset: Dataset) -> Optional[torch.utils.data.sampler.Sampler]:
+ if isinstance(eval_dataset, torch.utils.data.IterableDataset):
+ return None
+ elif is_torch_tpu_available():
+ return SequentialDistributedSampler(eval_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal())
+ elif self.args.local_rank != -1:
+ return SequentialDistributedSampler(eval_dataset)
+ else:
+ return SequentialSampler(eval_dataset)
+
+ def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+ """
+ Returns the evaluation :class:`~torch.utils.data.DataLoader`.
+
+ Will use no sampler if :obj:`self.eval_dataset` is a :obj:`torch.utils.data.IterableDataset`, a sequential
+ sampler (adapted to distributed training if necessary) otherwise.
+
+ Subclass and override this method if you want to inject some custom behavior.
+
+ Args:
+ eval_dataset (:obj:`torch.utils.data.dataset.Dataset`, `optional`):
+ If provided, will override :obj:`self.eval_dataset`. If it is an :obj:`datasets.Dataset`, columns not
+ accepted by the ``model.forward()`` method are automatically removed.
+ """
+ if eval_dataset is None and self.eval_dataset is None:
+ raise ValueError("Trainer: evaluation requires an eval_dataset.")
+ elif eval_dataset is not None and is_datasets_available() and isinstance(eval_dataset, datasets.Dataset):
+ self._remove_unused_columns(eval_dataset, description="evaluation")
+ eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
+ eval_sampler = self._get_eval_sampler(eval_dataset)
+
+ return DataLoader(
+ eval_dataset,
+ sampler=eval_sampler,
+ batch_size=self.args.eval_batch_size,
+ collate_fn=self.data_collator,
+ drop_last=self.args.dataloader_drop_last,
+ num_workers=self.args.dataloader_num_workers,
+ worker_init_fn=np.random.seed(self.args.seed)
+ )
+
+ def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
+ """
+ Returns the test :class:`~torch.utils.data.DataLoader`.
+
+ Will use no sampler if :obj:`test_dataset` is a :obj:`torch.utils.data.IterableDataset`, a sequential
+ sampler (adapted to distributed training if necessary) otherwise.
+
+ Subclass and override this method if you want to inject some custom behavior.
+
+ Args:
+ eval_dataset (:obj:`torch.utils.data.dataset.Dataset`, `optional`):
+ The test dataset to use. If it is an :obj:`datasets.Dataset`, columns not accepted by the
+ ``model.forward()`` method are automatically removed.
+ """
+ if is_datasets_available() and isinstance(test_dataset, datasets.Dataset):
+ self._remove_unused_columns(test_dataset, description="test")
+ test_sampler = self._get_eval_sampler(test_dataset)
+
+ # We use the same batch_size as for eval.
+ return DataLoader(
+ test_dataset,
+ sampler=test_sampler,
+ batch_size=self.args.eval_batch_size,
+ collate_fn=self.data_collator,
+ drop_last=self.args.dataloader_drop_last,
+ worker_init_fn=np.random.seed(self.args.seed)
+ )
+
+ def create_optimizer_and_scheduler(self, num_training_steps: int):
+ """
+ Setup the optimizer and the learning rate scheduler.
+
+ We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+ Trainer's init through :obj:`optimizers`, or subclass and override this method in a subclass.
+ """
+ if self.optimizer is None:
+ no_decay = ["bias", "LayerNorm.weight"]
+ optimizer_grouped_parameters = [
+ {
+ "params": [p for n, p in self.model.named_parameters() if (not any(nd in n for nd in no_decay)) and p.requires_grad],
+ "weight_decay": self.args.weight_decay,
+ },
+ {
+ "params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay) and p.requires_grad],
+ "weight_decay": 0.0,
+ },
+ ]
+
+ self.optimizer = AdamW(
+ optimizer_grouped_parameters,
+ lr=self.args.learning_rate,
+ betas=(self.args.adam_beta1, self.args.adam_beta2),
+ eps=self.args.adam_epsilon,
+ )
+
+
+ # for n, p in self.model.named_parameters():
+ # print(n,p.requires_grad)
+ print(self.optimizer.state_dict())
+ if self.lr_scheduler is None:
+ self.lr_scheduler = get_linear_schedule_with_warmup(
+ self.optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=num_training_steps
+ )
+
+
+ def setup_wandb(self):
+ """
+ Setup the optional Weights & Biases (`wandb`) integration.
+
+ One can subclass and override this method to customize the setup if needed. Find more information
+ `here `__. You can also override the following environment variables:
+
+ Environment:
+ WANDB_WATCH:
+ (Optional, ["gradients", "all", "false"]) "gradients" by default, set to "false" to disable gradient logging
+ or "all" to log gradients and parameters
+ WANDB_PROJECT:
+ (Optional): str - "huggingface" by default, set this to a custom string to store results in a different project
+ WANDB_DISABLED:
+ (Optional): boolean - defaults to false, set to "true" to disable wandb entirely
+ """
+ if hasattr(self, "_setup_wandb"):
+ warnings.warn(
+ "The `_setup_wandb` method is deprecated and won't be called in a future version, define `setup_wandb` in your subclass.",
+ FutureWarning,
+ )
+ return self._setup_wandb()
+
+ if self.is_world_process_zero():
+ logger.info(
+ 'Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"'
+ )
+ try:
+ combined_dict = {**self.model.config.to_dict(), **self.args.to_sanitized_dict()}
+ except AttributeError:
+ # in case the model has no config
+ combined_dict = {**self.args.to_sanitized_dict()}
+ wandb.init(
+ project=os.getenv("WANDB_PROJECT", "huggingface"), config=combined_dict, name=self.args.run_name
+ )
+ # keep track of model topology and gradients, unsupported on TPU
+ if not is_torch_tpu_available() and os.getenv("WANDB_WATCH") != "false":
+ wandb.watch(
+ self.model, log=os.getenv("WANDB_WATCH", "gradients"), log_freq=max(100, self.args.logging_steps)
+ )
+
+ def setup_comet(self):
+ """
+ Setup the optional Comet.ml integration.
+
+ Environment:
+ COMET_MODE:
+ (Optional): str - "OFFLINE", "ONLINE", or "DISABLED"
+ COMET_PROJECT_NAME:
+ (Optional): str - Comet.ml project name for experiments
+ COMET_OFFLINE_DIRECTORY:
+ (Optional): str - folder to use for saving offline experiments when `COMET_MODE` is "OFFLINE"
+
+ For a number of configurable items in the environment,
+ see `here `__
+ """
+ if self.is_world_master():
+ comet_mode = os.getenv("COMET_MODE", "ONLINE").upper()
+ args = {"project_name": os.getenv("COMET_PROJECT_NAME", "huggingface")}
+ experiment = None
+ if comet_mode == "ONLINE":
+ experiment = comet_ml.Experiment(**args)
+ logger.info("Automatic Comet.ml online logging enabled")
+ elif comet_mode == "OFFLINE":
+ args["offline_directory"] = os.getenv("COMET_OFFLINE_DIRECTORY", "./")
+ experiment = comet_ml.OfflineExperiment(**args)
+ logger.info("Automatic Comet.ml offline logging enabled; use `comet upload` when finished")
+ if experiment is not None:
+ experiment._set_model_graph(self.model, framework="transformers")
+ experiment._log_parameters(self.args, prefix="args/", framework="transformers")
+ experiment._log_parameters(self.model.config, prefix="config/", framework="transformers")
+
+ def num_examples(self, dataloader: DataLoader) -> int:
+ """
+ Helper to get number of samples in a :class:`~torch.utils.data.DataLoader` by accessing its dataset.
+ """
+ return len(dataloader.dataset)
+
+ def _setup_loggers(self):
+ if self._loggers_initialized:
+ return
+ if is_wandb_available():
+ self.setup_wandb()
+ elif os.environ.get("WANDB_DISABLED") != "true":
+ logger.info(
+ "You are instantiating a Trainer but W&B is not installed. To use wandb logging, "
+ "run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface."
+ )
+ if is_comet_available():
+ self.setup_comet()
+ elif os.environ.get("COMET_MODE") != "DISABLED":
+ logger.info(
+ "To use comet_ml logging, run `pip/conda install comet_ml` "
+ "see https://www.comet.ml/docs/python-sdk/huggingface/"
+ )
+ self._loggers_initialized = True
+
+ def _hp_search_setup(self, trial: Union["optuna.Trial", Dict[str, Any]]):
+ """ HP search setup code """
+ if self.hp_search_backend is None or trial is None:
+ return
+ params = self.hp_space(trial) if self.hp_search_backend == HPSearchBackend.OPTUNA else trial
+ for key, value in params.items():
+ if not hasattr(self.args, key):
+ raise AttributeError(
+ f"Trying to set {key} in the hyperparameter search but there is no corresponding field in `TrainingArguments`."
+ )
+ old_attr = getattr(self.args, key, None)
+ # Casting value to the proper type
+ if old_attr is not None:
+ value = type(old_attr)(value)
+ setattr(self.args, key, value)
+ if self.hp_search_backend == HPSearchBackend.OPTUNA:
+ logger.info("Trial:", trial.params)
+
+ def _report_to_hp_search(
+ self, trial: Union["optuna.Trial", Dict[str, Any]], epoch: int, metrics: Dict[str, float]
+ ):
+ if self.hp_search_backend is None or trial is None:
+ return
+ self.objective = self.compute_objective(metrics)
+ if self.hp_search_backend == HPSearchBackend.OPTUNA:
+ trial.report(self.objective, epoch)
+ if trial.should_prune():
+ raise optuna.TrialPruned()
+ elif self.hp_search_backend == HPSearchBackend.RAY:
+ if self.global_step % self.args.save_steps == 0:
+ self._tune_save_checkpoint()
+ tune.report(objective=self.objective, **metrics)
+
+ def _tune_save_checkpoint(self):
+ if not self.use_tune_checkpoints:
+ return
+ with tune.checkpoint_dir(step=self.global_step) as checkpoint_dir:
+ self.args.output_dir = checkpoint_dir
+ output_dir = os.path.join(self.args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}")
+ self.save_model(output_dir)
+ if self.is_world_master():
+ torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
+ torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
+
+
+ def train(self, model_path: Optional[str] = None, trial: Union["optuna.Trial", Dict[str, Any]] = None):
+ """
+ Main training entry point.
+
+ Args:
+ model_path (:obj:`str`, `optional`):
+ Local path to the model if the model to train has been instantiated from a local path. If present,
+ training will resume from the optimizer/scheduler states loaded here.
+ trial (:obj:`optuna.Trial` or :obj:`Dict[str, Any]`, `optional`):
+ The trial run or the hyperparameter dictionary for hyperparameter search.
+ """
+ # This might change the seed so needs to run first.
+ self._hp_search_setup(trial)
+
+ # Model re-init
+ if self.model_init is not None:
+ # Seed must be set before instantiating the model when using model_init.
+ set_seed(self.args.seed)
+ model = self.model_init()
+ self.model = model.to(self.args.device)
+
+ # Reinitializes optimizer and scheduler
+ self.optimizer, self.lr_scheduler = None, None
+
+ # Data loader and number of training steps
+ train_dataloader = self.get_train_dataloader()
+ num_update_steps_per_epoch = len(train_dataloader) // self.args.gradient_accumulation_steps
+ num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
+ if self.args.max_steps > 0:
+ t_total = self.args.max_steps
+ num_train_epochs = self.args.max_steps // num_update_steps_per_epoch + int(
+ self.args.max_steps % num_update_steps_per_epoch > 0
+ )
+ else:
+ t_total = int(num_update_steps_per_epoch * self.args.num_train_epochs)
+ num_train_epochs = self.args.num_train_epochs
+ self.args.max_steps = t_total
+
+ self.create_optimizer_and_scheduler(num_training_steps=t_total)
+
+ # Check if saved optimizer or scheduler states exist
+ if (
+ model_path is not None
+ and os.path.isfile(os.path.join(model_path, "optimizer.pt"))
+ and os.path.isfile(os.path.join(model_path, "scheduler.pt"))
+ ):
+ # Load in optimizer and scheduler states
+ self.optimizer.load_state_dict(
+ torch.load(os.path.join(model_path, "optimizer.pt"), map_location=self.args.device)
+ )
+ self.lr_scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")))
+
+ model = self.model
+ if self.args.fp16 and _use_apex:
+ if not is_apex_available():
+ raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+ model, self.optimizer = amp.initialize(model, self.optimizer, opt_level=self.args.fp16_opt_level)
+
+ # multi-gpu training (should be after apex fp16 initialization)
+ if self.args.n_gpu > 1:
+ model = torch.nn.DataParallel(model)
+
+ # Distributed training (should be after apex fp16 initialization)
+ if self.args.local_rank != -1:
+ model = torch.nn.parallel.DistributedDataParallel(
+ model,
+ device_ids=[self.args.local_rank],
+ output_device=self.args.local_rank,
+ find_unused_parameters=True,
+ )
+
+ if self.tb_writer is not None:
+ self.tb_writer.add_text("args", self.args.to_json_string())
+ self.tb_writer.add_hparams(self.args.to_sanitized_dict(), metric_dict={})
+
+ # Train!
+ if is_torch_tpu_available():
+ total_train_batch_size = self.args.train_batch_size * xm.xrt_world_size()
+ else:
+ total_train_batch_size = (
+ self.args.train_batch_size
+ * self.args.gradient_accumulation_steps
+ * (torch.distributed.get_world_size() if self.args.local_rank != -1 else 1)
+ )
+ logger.info("***** Running training *****")
+ logger.info(" Num examples = %d", self.num_examples(train_dataloader))
+ logger.info(" Num Epochs = %d", num_train_epochs)
+ logger.info(" Instantaneous batch size per device = %d", self.args.per_device_train_batch_size)
+ logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d", total_train_batch_size)
+ logger.info(" Gradient Accumulation steps = %d", self.args.gradient_accumulation_steps)
+ logger.info(" Total optimization steps = %d", t_total)
+
+ self.global_step = 0
+ self.epoch = 0
+ self.total_flos = 0
+ epochs_trained = 0
+ steps_trained_in_current_epoch = 0
+ # Check if continuing training from a checkpoint
+ if model_path is not None:
+ # set global_step to global_step of last saved checkpoint from model path
+ try:
+ self.global_step = int(model_path.split("-")[-1].split(os.path.sep)[0])
+ # print(model, model.module)
+ if self.args.n_gpu > 1:
+ self.total_flos = getattr(model.module.config, "total_flos", 0)
+ else:
+ self.total_flos = getattr(model.config, "total_flos", 0)
+
+ epochs_trained = self.global_step // num_update_steps_per_epoch
+ steps_trained_in_current_epoch = self.global_step % (num_update_steps_per_epoch)
+
+ logger.info(" Continuing training from checkpoint, will skip to saved global_step")
+ logger.info(" Continuing training from epoch %d", epochs_trained)
+ logger.info(" Continuing training from global step %d", self.global_step)
+ logger.info(" Continuing training from %d non-embedding floating-point operations", self.total_flos)
+ logger.info(" Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
+ except ValueError:
+ self.global_step = 0
+ self.total_flos = 0
+ logger.info(" Starting fine-tuning.")
+
+ tr_loss = torch.tensor(0.0).to(self.args.device)
+ logging_loss_scalar = 0.0
+ model.zero_grad()
+ disable_tqdm = self.args.disable_tqdm or not self.is_local_process_zero()
+ train_pbar = trange(epochs_trained, int(np.ceil(num_train_epochs)), desc="Epoch", disable=disable_tqdm)
+ for epoch in range(epochs_trained, int(np.ceil(num_train_epochs))):
+ if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):
+ train_dataloader.sampler.set_epoch(epoch)
+
+ if is_torch_tpu_available():
+ parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(
+ self.args.device
+ )
+ epoch_iterator = parallel_loader
+ else:
+ epoch_iterator = train_dataloader
+
+ # Reset the past mems state at the beginning of each epoch if necessary.
+ if self.args.past_index >= 0:
+ self._past = None
+
+ epoch_pbar = tqdm(epoch_iterator, desc="Iteration", disable=disable_tqdm)
+ for step, inputs in enumerate(epoch_iterator):
+
+ # Skip past any already trained steps if resuming training
+ if steps_trained_in_current_epoch > 0:
+ steps_trained_in_current_epoch -= 1
+ epoch_pbar.update(1)
+ continue
+
+ tr_loss += self.training_step(model, inputs)
+
+ self.total_flos += self.floating_point_ops(inputs)
+
+ if (step + 1) % self.args.gradient_accumulation_steps == 0 or (
+ # last step in epoch but step is always smaller than gradient_accumulation_steps
+ len(epoch_iterator) <= self.args.gradient_accumulation_steps
+ and (step + 1) == len(epoch_iterator)
+ ):
+ if self.args.fp16 and _use_native_amp:
+ self.scaler.unscale_(self.optimizer)
+ torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)
+ elif self.args.fp16 and _use_apex:
+ torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), self.args.max_grad_norm)
+ else:
+ torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)
+
+ if is_torch_tpu_available():
+ xm.optimizer_step(self.optimizer)
+ elif self.args.fp16 and _use_native_amp:
+ self.scaler.step(self.optimizer)
+ self.scaler.update()
+ else:
+ self.optimizer.step()
+
+ # URGENT
+ self.lr_scheduler.step()
+ model.zero_grad()
+ self.global_step += 1
+ self.epoch = epoch + (step + 1) / len(epoch_iterator)
+
+
+ if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (
+ self.global_step == 1 and self.args.logging_first_step
+ ):
+ logs: Dict[str, float] = {}
+ tr_loss_scalar = tr_loss.item()
+ logs["loss"] = (tr_loss_scalar - logging_loss_scalar) / self.args.logging_steps
+ # backward compatibility for pytorch schedulers
+ logs["learning_rate"] = (
+ self.lr_scheduler.get_last_lr()[0]
+ if version.parse(torch.__version__) >= version.parse("1.4")
+ else self.lr_scheduler.get_lr()[0]
+ )
+ logging_loss_scalar = tr_loss_scalar
+
+ self.log(logs)
+
+ # print(self.args.evaluation_strategy == EvaluationStrategy.STEPS )
+ # print(self.global_step % self.args.eval_steps == 0)
+ # print()
+
+ if (
+ self.args.evaluation_strategy == EvaluationStrategy.STEPS
+ and self.global_step % self.args.eval_steps == 0
+ ):
+ metrics = self.evaluate()
+ self._report_to_hp_search(trial, epoch, metrics)
+
+ #############################EARLY STOPPING########################
+ if 'lowdata' in self.args.output_dir or 'earlystop' in self.args.output_dir:
+ self.save_based_on_eval = True
+ else:
+ self.save_based_on_eval = False
+ print('if not see a line lowdata: below, then did not go into low data. ')
+ if self.save_based_on_eval and metrics["eval_loss"] < self.curr_best_eval:
+ print('lowdata:', self.global_step, self.curr_best_eval, metrics["eval_loss"],
+ 'perplexity={}'.format(math.exp(metrics["eval_loss"])))
+ self.curr_best_eval = metrics["eval_loss"]
+ if hasattr(model, "module"):
+ assert (
+ model.module is self.model
+ ), f"Module {model.module} should be a reference to self.model"
+ else:
+ assert model is self.model, f"Model {model} should be a reference to self.model"
+ # Save model checkpoint
+ output_dir_name = os.path.basename(self.args.output_dir)
+ checkpoint_folder = f"{output_dir_name}-{PREFIX_CHECKPOINT_DIR}-{self.global_step}"
+ if self.hp_search_backend is not None and trial is not None:
+ run_id = (
+ trial.number
+ if self.hp_search_backend == HPSearchBackend.OPTUNA
+ else tune.get_trial_id()
+ )
+ checkpoint_folder += f"-run-{run_id}"
+ output_dir = os.path.join(self.args.output_dir, checkpoint_folder)
+
+ self.store_flos()
+ print('saving to output_dir', output_dir)
+ self.save_model(output_dir)
+
+ if self.is_world_process_zero():
+ self._rotate_checkpoints(use_mtime=True)
+ #####################################################
+
+ if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
+ print('saving model at a checkpoint!!')
+ # In all cases (even distributed/parallel), self.model is always a reference
+ # to the model we want to save.
+ if hasattr(model, "module"):
+ assert (
+ model.module is self.model
+ ), f"Module {model.module} should be a reference to self.model"
+ else:
+ assert model is self.model, f"Model {model} should be a reference to self.model"
+ # Save model checkpoint
+ checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}"
+ if self.hp_search_backend is not None and trial is not None:
+ run_id = (
+ trial.number
+ if self.hp_search_backend == HPSearchBackend.OPTUNA
+ else tune.get_trial_id()
+ )
+ checkpoint_folder += f"-run-{run_id}"
+ output_dir = os.path.join(self.args.output_dir, checkpoint_folder)
+
+ self.store_flos()
+
+ self.save_model(output_dir)
+
+ if self.is_world_process_zero():
+ self._rotate_checkpoints(use_mtime=True)
+
+ if is_torch_tpu_available():
+ xm.rendezvous("saving_optimizer_states")
+ xm.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
+ xm.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
+ elif self.is_world_process_zero():
+ torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
+ torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
+
+ epoch_pbar.update(1)
+ if self.args.max_steps > 0 and self.global_step >= self.args.max_steps:
+ break
+ epoch_pbar.close()
+ train_pbar.update(1)
+
+ if self.args.evaluation_strategy == EvaluationStrategy.EPOCH:
+ metrics = self.evaluate()
+ self._report_to_hp_search(trial, epoch, metrics)
+
+ if self.args.tpu_metrics_debug or self.args.debug:
+ if is_torch_tpu_available():
+ # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
+ xm.master_print(met.metrics_report())
+ else:
+ logger.warning(
+ "You enabled PyTorch/XLA debug metrics but you don't have a TPU "
+ "configured. Check your training configuration if this is unexpected."
+ )
+ if self.args.max_steps > 0 and self.global_step >= self.args.max_steps:
+ break
+
+ train_pbar.close()
+ if self.tb_writer:
+ self.tb_writer.close()
+ if self.args.past_index and hasattr(self, "_past"):
+ # Clean the state at the end of training
+ delattr(self, "_past")
+
+ logger.info("\n\nTraining completed. Do not forget to share your model on huggingface.co/models =)\n\n")
+ return TrainOutput(self.global_step, tr_loss.item() / self.global_step)
+
+ def hyperparameter_search(
+ self,
+ hp_space: Optional[Callable[["optuna.Trial"], Dict[str, float]]] = None,
+ compute_objective: Optional[Callable[[Dict[str, float]], float]] = None,
+ n_trials: int = 20,
+ direction: str = "minimize",
+ backend: Optional[Union["str", HPSearchBackend]] = None,
+ **kwargs
+ ) -> BestRun:
+ """
+ Launch an hyperparameter search using ``optuna`` or ``Ray Tune``. The optimized quantity is determined by
+ :obj:`compute_objectie`, which defaults to a function returning the evaluation loss when no metric is provided,
+ the sum of all metrics otherwise.
+
+ .. warning::
+
+ To use this method, you need to have provided a ``model_init`` when initializing your
+ :class:`~transformers.Trainer`: we need to reinitialize the model at each new run. This is incompatible
+ with the ``optimizers`` argument, so you need to subclass :class:`~transformers.Trainer` and override the
+ method :meth:`~transformers.Trainer.create_optimizer_and_scheduler` for custom optimizer/scheduler.
+
+ Args:
+ hp_space (:obj:`Callable[["optuna.Trial"], Dict[str, float]]`, `optional`):
+ A function that defines the hyperparameter search space. Will default to
+ :func:`~transformers.trainer_utils.default_hp_space_optuna` or
+ :func:`~transformers.trainer_utils.default_hp_space_ray` depending on your backend.
+ compute_objective (:obj:`Callable[[Dict[str, float]], float]`, `optional`):
+ A function computing the objective to minimize or maximize from the metrics returned by the
+ :obj:`evaluate` method. Will default to :func:`~transformers.trainer_utils.default_compute_objective`.
+ n_trials (:obj:`int`, `optional`, defaults to 100):
+ The number of trial runs to test.
+ direction(:obj:`str`, `optional`, defaults to :obj:`"minimize"`):
+ Whether to optimize greater or lower objects. Can be :obj:`"minimize"` or :obj:`"maximize"`, you should
+ pick :obj:`"minimize"` when optimizing the validation loss, :obj:`"maximize"` when optimizing one or
+ several metrics.
+ backend(:obj:`str` or :class:`~transformers.training_utils.HPSearchBackend`, `optional`):
+ The backend to use for hyperparameter search. Will default to optuna or Ray Tune, depending on which
+ one is installed. If both are installed, will default to optuna.
+ kwargs:
+ Additional keyword arguments passed along to :obj:`optuna.create_study` or :obj:`ray.tune.run`. For
+ more information see:
+
+ - the documentation of `optuna.create_study `__
+ - the documentation of `tune.run `__
+
+ Returns:
+ :class:`transformers.trainer_utils.BestRun`: All the informations about the best run.
+ """
+ if backend is None:
+ backend = default_hp_search_backend()
+ if backend is None:
+ raise RuntimeError(
+ "At least one of optuna or ray should be installed. "
+ "To install optuna run `pip install optuna`."
+ "To install ray run `pip install ray[tune]`."
+ )
+ backend = HPSearchBackend(backend)
+ if backend == HPSearchBackend.OPTUNA and not is_optuna_available():
+ raise RuntimeError("You picked the optuna backend, but it is not installed. Use `pip install optuna`.")
+ if backend == HPSearchBackend.RAY and not is_ray_available():
+ raise RuntimeError(
+ "You picked the Ray Tune backend, but it is not installed. Use `pip install 'ray[tune]'`."
+ )
+ self.hp_search_backend = backend
+
+ if self.model_init is None:
+ raise RuntimeError(
+ "To use hyperparameter search, you need to pass your model through a model_init function."
+ )
+
+ self.hp_space = default_hp_space[backend] if hp_space is None else hp_space
+ self.compute_objective = default_compute_objective if compute_objective is None else compute_objective
+
+ run_hp_search = run_hp_search_optuna if backend == HPSearchBackend.OPTUNA else run_hp_search_ray
+ best_run = run_hp_search(self, n_trials, direction, **kwargs)
+
+ self.hp_search_backend = None
+ return best_run
+
+ def log(self, logs: Dict[str, float], iterator: Optional[tqdm] = None) -> None:
+ """
+ Log :obj:`logs` on the various objects watching training.
+
+ Subclass and override this method to inject custom behavior.
+
+ Args:
+ logs (:obj:`Dict[str, float]`):
+ The values to log.
+ iterator (:obj:`tqdm`, `optional`):
+ A potential tqdm progress bar to write the logs on.
+ """
+ # Set up loggers like W&B or Comet ML
+ self._setup_loggers()
+
+ if hasattr(self, "_log"):
+ warnings.warn(
+ "The `_log` method is deprecated and won't be called in a future version, define `log` in your subclass.",
+ FutureWarning,
+ )
+ return self._log(logs, iterator=iterator)
+
+ if self.epoch is not None:
+ logs["epoch"] = self.epoch
+ if self.total_flos is not None:
+ if self.args.local_rank != -1:
+ total_flos = distributed_broadcast_scalars([self.total_flos]).sum().item()
+ else:
+ total_flos = self.total_flos
+ if total_flos > 0:
+ logs["total_flos"] = self.total_flos
+ if self.global_step is None:
+ # when logging evaluation metrics without training
+ self.global_step = 0
+ if self.tb_writer:
+ for k, v in logs.items():
+ if isinstance(v, (int, float)):
+ self.tb_writer.add_scalar(k, v, self.global_step)
+ else:
+ logger.warning(
+ "Trainer is attempting to log a value of "
+ '"%s" of type %s for key "%s" as a scalar. '
+ "This invocation of Tensorboard's writer.add_scalar() "
+ "is incorrect so we dropped this attribute.",
+ v,
+ type(v),
+ k,
+ )
+ self.tb_writer.flush()
+ if is_wandb_available():
+ if self.is_world_process_zero():
+ wandb.log(logs, step=self.global_step)
+ if is_comet_available():
+ if self.is_world_process_zero():
+ experiment = comet_ml.config.get_global_experiment()
+ if experiment is not None:
+ experiment._log_metrics(logs, step=self.global_step, epoch=self.epoch, framework="transformers")
+ output = {**logs, **{"step": self.global_step}}
+ if self.is_world_process_zero():
+ self.log_history.append(output)
+ if iterator is not None:
+ iterator.write(output)
+ else:
+ print(output)
+
+ def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:
+ """
+ Prepare :obj:`inputs` before feeding them to the model, converting them to tensors if they are not already and
+ handling potential state.
+ """
+ for k, v in inputs.items():
+ if isinstance(v, torch.Tensor):
+ inputs[k] = v.to(self.args.device)
+
+ if self.args.past_index >= 0 and self._past is not None:
+ assert False
+ inputs["mems"] = self._past
+
+ return inputs
+
+ def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
+ """
+ Perform a training step on a batch of inputs.
+
+ Subclass and override to inject custom behavior.
+
+ Args:
+ model (:obj:`nn.Module`):
+ The model to train.
+ inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
+ The inputs and targets of the model.
+
+ The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+ argument :obj:`labels`. Check your model's documentation for all accepted arguments.
+
+ Return:
+ :obj:`torch.Tensor`: The tensor with training loss on this batch.
+ """
+ if hasattr(self, "_training_step"):
+ warnings.warn(
+ "The `_training_step` method is deprecated and won't be called in a future version, define `training_step` in your subclass.",
+ FutureWarning,
+ )
+ return self._training_step(model, inputs, self.optimizer)
+
+ model.train()
+ if self.use_dropout:
+ if self.gpt2 is not None:
+ self.gpt2.train()
+ inputs = self._prepare_inputs(inputs)
+
+ if self.args.fp16 and _use_native_amp:
+ with autocast():
+ if self.distill:
+ loss = self.compute_loss_distill(model, inputs, gpt2_model=self.gpt2, )
+ else:
+ loss = self.compute_loss(model, inputs, gpt2_model=self.gpt2)
+ else:
+ if self.distill:
+ loss = self.compute_loss_distill(model, inputs, gpt2_model=self.gpt2)
+ else:
+ loss = self.compute_loss(model, inputs, gpt2_model=self.gpt2)
+
+ if self.args.n_gpu > 1:
+ loss = loss.mean() # mean() to average on multi-gpu parallel training
+
+ if self.args.gradient_accumulation_steps > 1:
+ loss = loss / self.args.gradient_accumulation_steps
+
+ if self.args.fp16 and _use_native_amp:
+ self.scaler.scale(loss).backward()
+ elif self.args.fp16 and _use_apex:
+ with amp.scale_loss(loss, self.optimizer) as scaled_loss:
+ scaled_loss.backward()
+ else:
+ # print(loss)
+ loss.backward()
+
+ # print('max allocated_memory:', torch.cuda.max_memory_allocated(0), 'total_memory:', torch.cuda.get_device_properties(0).total_memory,
+ # 'percentage', torch.cuda.max_memory_allocated(0)/torch.cuda.get_device_properties(0).total_memory)
+
+
+ return loss.detach()
+
+
+
+
+
+ def compute_loss(self, model, inputs, gpt2_model=None):
+ """
+ How the loss is computed by Trainer. By default, all models return the loss in the first element.
+
+ Subclass and override for custom behavior.
+ """
+ # outputs = model.forward_weighted(**inputs)
+ if 'prompt_lab' in inputs:
+ prompt_lab_ = inputs['prompt_lab']
+ k = torch.cat(self.discri_labels_code, dim=0)
+ inputs['control_code'] = torch.index_select(k, 0, prompt_lab_)
+ del inputs['prompt_lab']
+
+ outputs = model(**inputs, gpt2_model=gpt2_model)
+ # Save past state if it exists
+ if self.args.past_index >= 0:
+ self._past = outputs[self.args.past_index]
+
+ # print(outputs[0])
+ # We don't use .loss here since the model may return tuples instead of ModelOutput.
+ # print(outputs[0], outputs.loss)
+ # URGENT
+ # print('compute_loss', outputs[0])
+ return outputs[0].mean()
+
+ def compute_loss_distill(self, model, inputs, gpt2_model=None):
+ """
+ How the loss is computed by Trainer. By default, all models return the loss in the first element.
+
+ Subclass and override for custom behavior.
+ """
+ # outputs = model.forward_weighted(**inputs)
+
+ with torch.no_grad():
+ output_finetuned = self.finetuned_gpt2(**inputs)
+
+ outputs = model(**inputs, gpt2_model=gpt2_model)
+ # Save past state if it exists
+ if self.args.past_index >= 0:
+ self._past = outputs[self.args.past_index]
+
+ if self.matching_objective == 'kl':
+ # distrib_finetuned=torch.log_softmax(output_finetuned.logits[:,:,:-2], dim=-1) #bsz, seqlen, vocab
+ distrib_finetuned=torch.log_softmax(output_finetuned.logits, dim=-1) #bsz, seqlen, vocab
+ distrib_prefix = torch.log_softmax(outputs.logits, dim=-1) # bsz, seqlen, vocab
+ loss = torch.sum(distrib_finetuned.exp() * (distrib_finetuned - distrib_prefix), dim=-1) #bsz, seqlen
+
+ elif self.matching_objective == 'logits':
+ loss = torch.norm(output_finetuned.logits - outputs.logits, dim=-1) #bsz, seqlen
+ # loss = torch.norm(output_finetuned.logits[:,:,:-2] - outputs.logits, dim=-1) #bsz, seqlen
+
+ elif self.matching_objective == 'last_layer':
+ activation_diff = output_finetuned.last_hidden_state - outputs.last_hidden_state
+ loss = torch.norm(activation_diff, dim=-1) # bsz, seqlen
+ else:
+ assert False, "invalid matching_objective"
+
+ return loss.sum(dim=-1).mean()
+
+ def is_local_master(self) -> bool:
+ """
+ Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on
+ several machines) main process.
+
+ .. warning::
+
+ This method is deprecated, use :meth:`~transformers.Trainer.is_local_process_zero` instead.
+ """
+ warnings.warn("This method is deprecated, use `Trainer.is_local_process_zero()` instead.", FutureWarning)
+ return self.is_local_process_zero()
+
+ def is_local_process_zero(self) -> bool:
+ """
+ Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on
+ several machines) main process.
+ """
+ if is_torch_tpu_available():
+ return xm.is_master_ordinal(local=True)
+ else:
+ return self.args.local_rank in [-1, 0]
+
+ def is_world_master(self) -> bool:
+ """
+ Whether or not this process is the global main process (when training in a distributed fashion on
+ several machines, this is only going to be :obj:`True` for one process).
+
+ .. warning::
+
+ This method is deprecated, use :meth:`~transformers.Trainer.is_world_process_zero` instead.
+ """
+ warnings.warn("This method is deprecated, use `Trainer.is_world_process_zero()` instead.", FutureWarning)
+ return self.is_world_process_zero()
+
+ def is_world_process_zero(self) -> bool:
+ """
+ Whether or not this process is the global main process (when training in a distributed fashion on
+ several machines, this is only going to be :obj:`True` for one process).
+ """
+ if is_torch_tpu_available():
+ return xm.is_master_ordinal(local=False)
+ else:
+ return self.args.local_rank == -1 or torch.distributed.get_rank() == 0
+
+ def save_model(self, output_dir: Optional[str] = None):
+ """
+ Will save the model, so you can reload it using :obj:`from_pretrained()`.
+
+ Will only save from the world_master process (unless in TPUs).
+ """
+
+ if is_torch_tpu_available():
+ self._save_tpu(output_dir)
+ elif self.is_world_process_zero():
+ self._save(output_dir)
+
+ def _save_tpu(self, output_dir: Optional[str] = None):
+ output_dir = output_dir if output_dir is not None else self.args.output_dir
+ logger.info("Saving model checkpoint to %s", output_dir)
+
+ if xm.is_master_ordinal():
+ os.makedirs(output_dir, exist_ok=True)
+ torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+ json.dump(
+ self.log_history, open(os.path.join(output_dir, "log_history.json"), "w"), indent=2, ensure_ascii=False
+ )
+
+ # Save a trained model and configuration using `save_pretrained()`.
+ # They can then be reloaded using `from_pretrained()`
+ if not isinstance(self.model, PreTrainedModel):
+ raise ValueError("Trainer.model appears to not be a PreTrainedModel")
+
+ xm.rendezvous("saving_checkpoint")
+ self.model.save_pretrained(output_dir)
+ if self.tokenizer is not None:
+ self.tokenizer.save_pretrained(output_dir)
+
+ def _save(self, output_dir: Optional[str] = None):
+ output_dir = output_dir if output_dir is not None else self.args.output_dir
+ os.makedirs(output_dir, exist_ok=True)
+ logger.info("Saving model checkpoint to %s", output_dir)
+ # Save a trained model and configuration using `save_pretrained()`.
+ # They can then be reloaded using `from_pretrained()`
+ if not isinstance(self.model, PreTrainedModel):
+ raise ValueError("Trainer.model appears to not be a PreTrainedModel")
+ self.model.save_pretrained(output_dir)
+ if self.tokenizer is not None:
+ self.tokenizer.save_pretrained(output_dir)
+
+ # Good practice: save your training arguments together with the trained model
+ torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+ json.dump(
+ self.log_history, open(os.path.join(output_dir, "log_history.json"), "w"), indent=2, ensure_ascii=False
+ )
+
+ def store_flos(self):
+ # Storing the number of floating-point operations that went into the model
+ if self.total_flos is not None:
+ if self.args.local_rank != -1:
+ total_flos = distributed_broadcast_scalars([self.total_flos]).sum().item()
+ else:
+ total_flos = self.total_flos
+ if total_flos > 0:
+ self.model.config.total_flos = total_flos
+
+ def _sorted_checkpoints(self, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False) -> List[str]:
+ output_dir_name = os.path.basename(self.args.output_dir)
+ checkpoint_prefix = f"{output_dir_name}-{PREFIX_CHECKPOINT_DIR}"
+
+ ordering_and_checkpoint_path = []
+
+ glob_checkpoints = [str(x) for x in Path(self.args.output_dir).glob(f"{checkpoint_prefix}-*")]
+
+ for path in glob_checkpoints:
+ if use_mtime:
+ ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
+ else:
+ regex_match = re.match(f".*{checkpoint_prefix}-([0-9]+)", path)
+ if regex_match and regex_match.groups():
+ ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
+
+ checkpoints_sorted = sorted(ordering_and_checkpoint_path)
+ checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
+ return checkpoints_sorted
+
+ def _rotate_checkpoints(self, use_mtime=False) -> None:
+ if self.args.save_total_limit is None or self.args.save_total_limit <= 0:
+ return
+
+ # Check if we should delete older checkpoint(s)
+ checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime)
+ if len(checkpoints_sorted) <= self.args.save_total_limit:
+ return
+
+ number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - self.args.save_total_limit)
+ checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
+ for checkpoint in checkpoints_to_be_deleted:
+ logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
+ shutil.rmtree(checkpoint)
+
+ def evaluate(self, eval_dataset: Optional[Dataset] = None) -> Dict[str, float]:
+ """
+ Run evaluation and returns metrics.
+
+ The calling script will be responsible for providing a method to compute metrics, as they are
+ task-dependent (pass it to the init :obj:`compute_metrics` argument).
+
+ You can also subclass and override this method to inject custom behavior.
+
+ Args:
+ eval_dataset (:obj:`Dataset`, `optional`):
+ Pass a dataset if you wish to override :obj:`self.eval_dataset`. If it is an :obj:`datasets.Dataset`,
+ columns not accepted by the ``model.forward()`` method are automatically removed.
+
+ Returns:
+ A dictionary containing the evaluation loss and the potential metrics computed from the predictions.
+ """
+ eval_dataloader = self.get_eval_dataloader(eval_dataset)
+
+ output = self.prediction_loop(eval_dataloader, description="Evaluation")
+
+ self.log(output.metrics)
+
+ if self.args.tpu_metrics_debug or self.args.debug:
+ # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
+ xm.master_print(met.metrics_report())
+
+ return output.metrics
+
+
+
+ def predict(self, test_dataset: Dataset) -> PredictionOutput:
+ """
+ Run prediction and returns predictions and potential metrics.
+
+ Depending on the dataset and your use case, your test dataset may contain labels.
+ In that case, this method will also return metrics, like in :obj:`evaluate()`.
+
+ Args:
+ test_dataset (:obj:`Dataset`):
+ Dataset to run the predictions on. If it is an :obj:`datasets.Dataset`, columns not accepted by the
+ ``model.forward()`` method are automatically removed.
+
+ Returns:
+ `NamedTuple`:
+ predictions (:obj:`np.ndarray`):
+ The predictions on :obj:`test_dataset`.
+ label_ids (:obj:`np.ndarray`, `optional`):
+ The labels (if the dataset contained some).
+ metrics (:obj:`Dict[str, float]`, `optional`):
+ The potential dictionary of metrics (if the dataset contained labels).
+ """
+ test_dataloader = self.get_test_dataloader(test_dataset)
+
+ return self.prediction_loop(test_dataloader, description="Prediction")
+
+ def prediction_loop(
+ self, dataloader: DataLoader, description: str, prediction_loss_only: Optional[bool] = None
+ ) -> PredictionOutput:
+ """
+ Prediction/evaluation loop, shared by :obj:`Trainer.evaluate()` and :obj:`Trainer.predict()`.
+
+ Works both with or without labels.
+ """
+ if hasattr(self, "_prediction_loop"):
+ warnings.warn(
+ "The `_prediction_loop` method is deprecated and won't be called in a future version, define `prediction_loop` in your subclass.",
+ FutureWarning,
+ )
+ return self._prediction_loop(dataloader, description, prediction_loss_only=prediction_loss_only)
+
+ prediction_loss_only = (
+ prediction_loss_only if prediction_loss_only is not None else self.args.prediction_loss_only
+ )
+
+ assert not getattr(
+ self.model.config, "output_attentions", False
+ ), "The prediction loop does not work with `output_attentions=True`."
+ assert not getattr(
+ self.model.config, "output_hidden_states", False
+ ), "The prediction loop does not work with `output_hidden_states=True`."
+
+ model = self.model
+ # multi-gpu eval
+ if self.args.n_gpu > 1:
+ model = torch.nn.DataParallel(model)
+ else:
+ model = self.model
+ # Note: in torch.distributed mode, there's no point in wrapping the model
+ # inside a DistributedDataParallel as we'll be under `no_grad` anyways.
+
+ batch_size = dataloader.batch_size
+ logger.info("***** Running %s *****", description)
+ logger.info(" Num examples = %d", self.num_examples(dataloader))
+ logger.info(" Batch size = %d", batch_size)
+ eval_losses: List[float] = []
+ preds: torch.Tensor = None
+ label_ids: torch.Tensor = None
+ entropy_losses: List[float] = []
+ model.eval()
+ if self.gpt2 is not None:
+ self.gpt2.eval()
+
+ print(model.training)
+ print(self.gpt2.training)
+
+ if is_torch_tpu_available():
+ dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)
+
+ if self.args.past_index >= 0:
+ self._past = None
+
+ disable_tqdm = not self.is_local_process_zero() or self.args.disable_tqdm
+ for inputs in tqdm(dataloader, desc=description, disable=disable_tqdm):
+ loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only)
+ batch_size = inputs[list(inputs.keys())[0]].shape[0]
+ if loss is not None:
+ eval_losses.extend([loss] * batch_size)
+ if logits is not None:
+ preds = logits if preds is None else nested_concat(preds, logits, dim=0)
+ temp_logits = [torch.log_softmax(x) for x in logits]
+ entropy_losses.extend([(x.exp() * x).sum() for x in temp_logits])
+ if labels is not None:
+ label_ids = labels if label_ids is None else nested_concat(label_ids, labels, dim=0)
+
+ if self.args.past_index and hasattr(self, "_past"):
+ # Clean the state at the end of the evaluation loop
+ delattr(self, "_past")
+
+
+
+ if self.compute_metrics is not None and preds is not None and label_ids is not None:
+ metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))
+ else:
+ metrics = {}
+
+ # Prefix all keys with eval_
+ for key in list(metrics.keys()):
+ if not key.startswith("eval_"):
+ metrics[f"eval_{key}"] = metrics.pop(key)
+ if len(entropy_losses) > 0:
+ metrics['entropy'] = np.mean(entropy_losses)
+ print('entropy', metrics['entropy'] )
+
+ return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)
+
+ def prediction_step(
+ self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]], prediction_loss_only: bool
+ ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
+ """
+ Perform an evaluation step on :obj:`model` using obj:`inputs`.
+
+ Subclass and override to inject custom behavior.
+
+ Args:
+ model (:obj:`nn.Module`):
+ The model to evaluate.
+ inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
+ The inputs and targets of the model.
+
+ The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+ argument :obj:`labels`. Check your model's documentation for all accepted arguments.
+ prediction_loss_only (:obj:`bool`):
+ Whether or not to return the loss only.
+
+ Return:
+ Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
+ A tuple with the loss, logits and labels (each being optional).
+ """
+ has_labels = all(inputs.get(k) is not None for k in self.args.label_names)
+ inputs = self._prepare_inputs(inputs)
+
+ # At eval time, set the weights to 1/bsz. and see the results..
+
+ # if 'weights' in inputs:
+ # weights = inputs['weights']
+ # bsz = weights.view(-1).shape[0]
+ # weights = (torch.ones(weights.shape)/bsz).to(weights.device)
+ # inputs['weights'] = weights
+
+ with torch.no_grad():
+ # outputs = model.forward_weighted(**inputs)
+ outputs = model(**inputs, gpt2_model=self.gpt2)
+ if has_labels:
+ # The .mean() is to reduce in case of distributed training
+ loss = outputs[0].mean().item()
+ logits = outputs[1:]
+ else:
+ loss = None
+ # Slicing so we get a tuple even if `outputs` is a `ModelOutput`.
+ logits = outputs[:]
+ if self.args.past_index >= 0:
+ self._past = outputs[self.args.past_index if has_labels else self.args.past_index - 1]
+
+ if prediction_loss_only:
+ return (loss, None, None)
+
+ logits = tuple(logit.detach() for logit in logits)
+ if len(logits) == 1:
+ logits = logits[0]
+
+ if has_labels:
+ labels = tuple(inputs.get(name).detach() for name in self.args.label_names)
+ if len(labels) == 1:
+ labels = labels[0]
+ else:
+ labels = None
+
+ return (loss, logits, labels)
+
+ def floating_point_ops(self, inputs: Dict[str, Union[torch.Tensor, Any]]):
+ """
+ For models that inherit from :class:`~transformers.PretrainedModel`, uses
+ that method to compute the number of floating point operations for every backward + forward pass. If using
+ another model, either implement such a method in the model or subclass and override this method.
+
+ Args:
+ model (:obj:`nn.Module`):
+ The model to evaluate.
+ inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
+ The inputs and targets of the model.
+
+ Returns:
+ :obj:`int`: The number of floating-point operations.
+ """
+
+ if isinstance(self.model, torch.nn.DataParallel) or isinstance(
+ self.model, torch.nn.parallel.DistributedDataParallel
+ ):
+ model = self.model.module
+ else:
+ model = self.model
+
+ if hasattr(model, "floating_point_ops"):
+ return model.floating_point_ops(inputs)
+
+ else:
+ return 0
\ No newline at end of file
diff --git a/dalle/utils/__init__.py b/dalle/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..776dd3a6ef93a2d905cbcaec159b6db320bdf3db
--- /dev/null
+++ b/dalle/utils/__init__.py
@@ -0,0 +1,3 @@
+from .utils import *
+from .config import *
+from .sampling import *
\ No newline at end of file
diff --git a/dalle/utils/__pycache__/__init__.cpython-38.pyc b/dalle/utils/__pycache__/__init__.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3d5329913307482e9cf9e4c122fc97034e19c2c2
Binary files /dev/null and b/dalle/utils/__pycache__/__init__.cpython-38.pyc differ
diff --git a/dalle/utils/__pycache__/config.cpython-38.pyc b/dalle/utils/__pycache__/config.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ee89be6c3dc9d70bee9d56dd50983a4a5e3d316d
Binary files /dev/null and b/dalle/utils/__pycache__/config.cpython-38.pyc differ
diff --git a/dalle/utils/__pycache__/sampling.cpython-38.pyc b/dalle/utils/__pycache__/sampling.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0a6cd8cf58094adfee641629754bffcf568cd107
Binary files /dev/null and b/dalle/utils/__pycache__/sampling.cpython-38.pyc differ
diff --git a/dalle/utils/__pycache__/utils.cpython-38.pyc b/dalle/utils/__pycache__/utils.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6837fcbd7fc891ddb589e9d27e20a0769184dfc5
Binary files /dev/null and b/dalle/utils/__pycache__/utils.cpython-38.pyc differ
diff --git a/dalle/utils/config.py b/dalle/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..9dfd95eda19d4c852b1c9a1865919f6b6f140482
--- /dev/null
+++ b/dalle/utils/config.py
@@ -0,0 +1,209 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+
+from typing import Optional, List
+from dataclasses import dataclass, field
+from omegaconf import OmegaConf
+
+
+@dataclass
+class DataConfig:
+ dataset: Optional[str] = None
+ tokenizer_type: str = 'CharBPE'
+ context_length: int = 64
+ image_resolution: int = 256
+ transforms: str = 'dalle-vqvae'
+ bpe_pdrop: Optional[float] = None
+
+
+@dataclass
+class Stage1Hparams:
+ double_z: bool = False
+ z_channels: int = 256
+ resolution: int = 256
+ in_channels: int = 3
+ out_ch: int = 3
+ ch: int = 128
+ ch_mult: List[int] = field(default_factory=lambda: [1, 1, 2, 2, 4])
+ num_res_blocks: int = 2
+ attn_resolutions: List[int] = field(default_factory=lambda: [16])
+ pdrop: float = 0.0
+
+
+@dataclass
+class Stage2Hparams:
+ embed_dim: int = 1536
+ n_layers: int = 42
+ n_heads: int = 24
+ n_dense_layers: int = 42
+ ctx_len_img: int = 256
+ ctx_len_txt: int = 64
+ embd_pdrop: float = 0.0
+ resid_pdrop: float = 0.0
+ attn_pdrop: float = 0.0
+ mlp_bias: bool = True
+ attn_bias: bool = True
+ gelu_use_approx: bool = False
+ use_head_txt: bool = True
+ n_classes: Optional[int] = None
+
+
+@dataclass
+class Stage1Config:
+ type: str = 'vqgan'
+ embed_dim: int = 256
+ n_embed: int = 16384
+ hparams: Stage1Hparams = Stage1Hparams()
+
+
+@dataclass
+class Stage2Config:
+ type: str = 'transformer1d'
+ vocab_size_txt: int = 16384
+ vocab_size_img: int = 16384
+ use_cls_cond: Optional[bool] = None
+ hparams: Stage2Hparams = Stage2Hparams()
+
+
+@dataclass
+class WarmupConfig:
+ epoch: int = 1
+ multiplier: int = 1
+ buffer_epoch: int = 0
+ min_lr: float = 0.0
+ mode: str = 'fix'
+ peak_lr: float = 1e-4
+ start_from_zero: bool = True
+
+
+@dataclass
+class OptConfig:
+ opt_type: str = 'adamW'
+ learning_rate: float = 5e-5
+ weight_decay: float = 1e-4
+ betas: List[float] = field(default_factory=lambda: [0.9, 0.99])
+ grad_clip_norm: float = 1.0
+
+ sched_type: str = 'cosine'
+ max_steps: int = 0
+ min_lr: float = 1e-6
+
+
+@dataclass
+class ExpConfig:
+ per_gpu_train_batch_size: int = 4
+ per_gpu_eval_batch_size: int = 32
+ num_train_epochs: int = 10
+ save_ckpt_freq: int = 1
+ test_freq: int = 10
+ use_amp: bool = True
+
+
+@dataclass
+class PrefixModelConfig:
+ model_name_or_path: Optional[str] = ''
+ prefix_model_name_or_path: str = ''
+ prefix_mode: str = 'activation'
+ tuning_mode: str = 'finetune'
+ top_k_layers: int = 2
+ parameterize_mode: str = 'mlp'
+ optim_prefix: bool = False
+ preseqlen: int = 10
+ prefix_dropout: float = 0.1
+ init_random: bool = False
+ hidden_dim_prefix: int = 512
+ lowdata: bool = False
+ lowdata_token: str = ''
+ init_shallow: bool = False
+ init_shallow_word: bool = False
+ teacher_dropout: float = 0.1
+ gumbel: bool = False
+ replay_buffer: bool = False
+
+
+@dataclass
+class PromptModelConfig:
+ model_name_or_path: Optional[str] = ''
+ prefix_model_name_or_path: str = ''
+ tuning_mode: str = 'prompt'
+ preseqlen: int = 10
+ prefix_dropout: float = 0.1
+
+
+@dataclass
+class StoryModelConfig:
+ model_name_or_path: Optional[str] = ''
+ prefix_model_name_or_path: str = ''
+ tuning_mode: str = 'story'
+ preseqlen: int = 10
+ prefix_dropout: float = 0.1
+ prompt: bool = False
+ story_len: int = 4
+ sent_embed: int = 256
+ condition: bool = False
+ clip_embed: bool = False
+
+
+@dataclass
+class DefaultConfig:
+ dataset: DataConfig = DataConfig()
+ stage1: Stage1Config = Stage1Config()
+ stage2: Stage2Config = Stage2Config()
+
+
+@dataclass
+class FineTuningConfig:
+ dataset: DataConfig = DataConfig()
+ stage1: Stage1Config = Stage1Config()
+ stage2: Stage2Config = Stage2Config()
+ optimizer: OptConfig = OptConfig()
+ experiment: ExpConfig = ExpConfig()
+
+
+@dataclass
+class PrefixTuningConfig:
+ dataset: DataConfig = DataConfig()
+ stage1: Stage1Config = Stage1Config()
+ stage2: Stage2Config = Stage2Config()
+ prefix: PrefixModelConfig = PrefixModelConfig()
+ optimizer: OptConfig = OptConfig()
+ experiment: ExpConfig = ExpConfig()
+
+
+@dataclass
+class PromptTuningConfig:
+ dataset: DataConfig = DataConfig()
+ stage1: Stage1Config = Stage1Config()
+ stage2: Stage2Config = Stage2Config()
+ prompt: PromptModelConfig = PromptModelConfig()
+ optimizer: OptConfig = OptConfig()
+ experiment: ExpConfig = ExpConfig()
+
+
+@dataclass
+class StoryConfig:
+ dataset: DataConfig = DataConfig()
+ stage1: Stage1Config = Stage1Config()
+ stage2: Stage2Config = Stage2Config()
+ story: StoryModelConfig = StoryModelConfig()
+ optimizer: OptConfig = OptConfig()
+ experiment: ExpConfig = ExpConfig()
+
+
+def get_base_config(mode):
+ if mode == 'default':
+ return OmegaConf.structured(DefaultConfig)
+ elif mode == 'finetuning':
+ return OmegaConf.structured(FineTuningConfig)
+ elif mode == 'prefixtuning':
+ return OmegaConf.structured(PrefixTuningConfig)
+ elif mode == 'prompt_tuning':
+ return OmegaConf.structured(PromptTuningConfig)
+ elif mode == 'story':
+ return OmegaConf.structured(StoryConfig)
+ else:
+ raise ValueError
+ # return OmegaConf.structured(DefaultConfig if use_default else FineTuningConfig)
diff --git a/dalle/utils/sampling.py b/dalle/utils/sampling.py
new file mode 100644
index 0000000000000000000000000000000000000000..26d544d960e33d3a7f0de63dd98fc1df1a521b6b
--- /dev/null
+++ b/dalle/utils/sampling.py
@@ -0,0 +1,369 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+
+import torch
+from typing import Optional
+from tqdm import tqdm
+from torch.nn import functional as F
+
+
+torch.set_printoptions(precision=2, threshold=10)
+def cutoff_topk_logits(logits: torch.FloatTensor, k: int) -> torch.FloatTensor:
+ if k is None:
+ return logits
+ else:
+ v, ix = torch.topk(logits, k)
+ out = logits.clone()
+ out[out < v[:, [-1]]] = -float('Inf')
+ return out
+
+
+def cutoff_topp_probs(probs: torch.FloatTensor, p: float) -> torch.FloatTensor:
+ if p is None:
+ return probs
+ else:
+ sorted_probs, sorted_indices = torch.sort(probs, dim=-1, descending=True)
+ cum_probs = torch.cumsum(sorted_probs, dim=-1)
+
+ sorted_idx_remove_cond = cum_probs >= p
+
+ sorted_idx_remove_cond[..., 1:] = sorted_idx_remove_cond[..., :-1].clone()
+ sorted_idx_remove_cond[..., 0] = 0
+
+ indices_to_remove = sorted_idx_remove_cond.scatter(-1, sorted_indices, sorted_idx_remove_cond)
+ probs = probs.masked_fill(indices_to_remove, 0.0)
+ norm_probs = probs / torch.sum(probs, dim=-1, keepdim=True)
+ return norm_probs
+
+
+def get_positional_encoding(inputs: torch.LongTensor, mode: str = '1d') -> torch.LongTensor:
+ device = inputs.device
+ if mode == '1d':
+ B, N = inputs.shape
+ xs_pos = torch.arange(N, device=device).repeat((B, 1))
+ elif mode == '2d':
+ B, H, W = inputs.shape
+ xs_pos_h = torch.arange(H, device=device).repeat(B, W, 1).transpose(1, 2)
+ xs_pos_w = torch.arange(W, device=device).repeat(B, H, 1)
+ xs_pos = (xs_pos_h, xs_pos_w)
+ else:
+ raise ValueError('%s positional encoding invalid' % mode)
+ return xs_pos
+
+
+@torch.no_grad()
+def sampling(model: torch.nn.Module,
+ tokens: torch.LongTensor,
+ top_k: Optional[float] = None,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ is_tqdm: bool = True,
+ use_fp16: bool = True,
+ max_seq_len: int = 256,
+ prompt: Optional[torch.tensor] = None,
+ pos_prompt: Optional[torch.Tensor] = None) -> torch.LongTensor:
+
+ code = None
+ past = None
+
+ pbar = tqdm(range(max_seq_len), total=max_seq_len) if is_tqdm else range(max_seq_len)
+ pos_enc_tokens = get_positional_encoding(tokens, mode='1d')
+
+ for cnt, h in enumerate(pbar):
+ if code is None:
+ code_ = None
+ pos_enc_code_ = None
+ else:
+ code_ = code.clone().detach()
+ pos_enc_code_ = get_positional_encoding(code_, mode='1d')
+ code_ = code_[:, cnt-1].unsqueeze(-1)
+ pos_enc_code_ = pos_enc_code_[:, cnt-1].unsqueeze(-1)
+
+ logits, present = model.sampling(images=code_,
+ texts=tokens,
+ pos_images=pos_enc_code_,
+ pos_texts=pos_enc_tokens,
+ use_fp16=use_fp16,
+ past=past,
+ prompt=prompt,
+ pos_prompt=pos_prompt)
+
+ logits = logits.to(dtype=torch.float32)
+ logits = logits / softmax_temperature
+
+ # print(len(present), present[0].shape)
+ present = torch.stack(present).clone().detach()
+ if past is None:
+ past = [present]
+ else:
+ past.append(present)
+
+ logits = cutoff_topk_logits(logits, top_k)
+ probs = F.softmax(logits, dim=-1)
+ probs = cutoff_topp_probs(probs, top_p)
+ # print(probs[0])
+
+ idx = torch.multinomial(probs, num_samples=1).clone().detach()
+ # print(idx)
+ code = idx if code is None else torch.cat([code, idx], axis=1)
+
+ del past
+ return code
+
+
+@torch.no_grad()
+def sampling_prefix(model: torch.nn.Module,
+ tokens: torch.LongTensor,
+ past: torch.FloatTensor,
+ top_k: Optional[float] = None,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ is_tqdm: bool = True,
+ use_fp16: bool = True,
+ max_seq_len: int = 256,
+ labels = None) -> torch.LongTensor:
+ code = None
+
+ pbar = tqdm(range(max_seq_len), total=max_seq_len) if is_tqdm else range(max_seq_len)
+ pos_enc_tokens = get_positional_encoding(tokens, mode='1d')
+
+ # print("Entering sampling_prefix; ", past.shape)
+ if past is not None:
+ past = [past]
+
+ for cnt, h in enumerate(pbar):
+ if code is None:
+ code_ = None
+ pos_enc_code_ = None
+ else:
+ code_ = code.clone().detach()
+ pos_enc_code_ = get_positional_encoding(code_, mode='1d')
+ code_ = code_[:, cnt-1].unsqueeze(-1)
+ pos_enc_code_ = pos_enc_code_[:, cnt-1].unsqueeze(-1)
+
+ # print("Looop enter")
+ # print(cnt, past[0].shape)
+ # print("-------------------")
+ logits, present = model.sampling(images=code_,
+ texts=tokens,
+ pos_images=pos_enc_code_,
+ pos_texts=pos_enc_tokens,
+ use_fp16=use_fp16,
+ past=past)
+ logits = logits.to(dtype=torch.float32)
+ logits = logits / softmax_temperature
+
+ present = torch.stack(present).clone().detach()
+
+ # print('Present', present.shape)
+
+ if past is None:
+ past = [present]
+ else:
+ # print("Loop end")
+ # print(present.shape)
+ # print("-----------------")
+
+ # n_layers, temp, _, seq_len, n_dim = present.shape
+ # _, _, bs, n_heads, pre_seq_len, n_dim = past[0].shape
+ # assert temp == 2
+ # past.append(present.view(n_layers, temp, bs, n_heads, seq_len, n_dim))
+
+ past.append(present)
+
+ logits = cutoff_topk_logits(logits, top_k)
+ probs = F.softmax(logits, dim=-1)
+ probs = cutoff_topp_probs(probs, top_p)
+ print(torch.topk(probs, 5, dim=-1))
+ if labels is not None:
+ print(labels[cnt])
+ idx = torch.multinomial(probs, num_samples=1).clone().detach()
+ # print(idx)
+ code = idx if code is None else torch.cat([code, idx], axis=1)
+
+ del past
+ return code
+
+
+@torch.no_grad()
+def sampling_prefix_new(model: torch.nn.Module,
+ tokens: torch.LongTensor,
+ past: torch.FloatTensor,
+ top_k: Optional[float] = None,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ is_tqdm: bool = True,
+ use_fp16: bool = True,
+ max_seq_len: int = 256) -> torch.LongTensor:
+ code = None
+
+ pbar = tqdm(range(max_seq_len), total=max_seq_len) if is_tqdm else range(max_seq_len)
+ pos_enc_tokens = get_positional_encoding(tokens, mode='1d')
+
+ # print("Entering sampling_prefix; ", past.shape)
+ if past is not None:
+ past = [past]
+
+ for cnt, h in enumerate(pbar):
+ if code is None:
+ code_ = None
+ pos_enc_code_ = None
+ else:
+ code_ = code.clone().detach()
+ pos_enc_code_ = get_positional_encoding(code_, mode='1d')
+ # code_ = code_[:, cnt-1].unsqueeze(-1)
+ # pos_enc_code_ = pos_enc_code_[:, cnt-1].unsqueeze(-1)
+
+ # print("Looop enter")
+ # print(cnt, past[0].shape)
+ # print("-------------------")
+
+ if cnt == 0:
+ logits, present = model.sampling(images=code_,
+ texts=tokens,
+ pos_images=pos_enc_code_,
+ pos_texts=pos_enc_tokens,
+ use_fp16=use_fp16,
+ past=past)
+ logits = logits.to(dtype=torch.float32)
+ logits = logits / softmax_temperature
+
+ present = torch.stack(present).clone().detach()
+
+ # print('Present', present.shape)
+
+ if past is None:
+ past = [present]
+ else:
+ pass
+
+ logits = cutoff_topk_logits(logits, top_k)
+ probs = F.softmax(logits, dim=-1)
+ probs = cutoff_topp_probs(probs, top_p)
+ # print(torch.topk(probs[0], 5))
+ idx = torch.multinomial(probs, num_samples=1).clone().detach()
+ # print(idx)
+ code = idx if code is None else torch.cat([code, idx], axis=1)
+
+ else:
+ pass
+
+
+ del past
+ return code
+
+@torch.no_grad()
+def sampling_conditional(model: torch.nn.Module,
+ cross_attention_idxs,
+ cross_attention_layers,
+ tokens: torch.LongTensor,
+ src_codes: torch.FloatTensor,
+ top_k: Optional[float] = None,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ is_tqdm: bool = True,
+ use_fp16: bool = True,
+ max_seq_len: int = 256,
+ prompt: Optional[torch.tensor] = None,
+ pos_prompt: Optional[torch.Tensor] = None) -> torch.LongTensor:
+
+ code = None
+ past = None
+
+ pbar = tqdm(range(max_seq_len), total=max_seq_len) if is_tqdm else range(max_seq_len)
+ pos_enc_tokens = get_positional_encoding(tokens, mode='1d')
+
+ src_pos_tokens = get_positional_encoding(src_codes, mode='1d')
+ src_tokens = model.tok_emb_img(src_codes)
+ src_tokens = src_tokens + model.pos_emb_img(src_pos_tokens)
+
+ for cnt, h in enumerate(pbar):
+ if code is None:
+ code_ = None
+ pos_enc_code_ = None
+ else:
+ code_ = code.clone().detach()
+ pos_enc_code_ = get_positional_encoding(code_, mode='1d')
+ code_ = code_[:, cnt-1].unsqueeze(-1)
+ pos_enc_code_ = pos_enc_code_[:, cnt-1].unsqueeze(-1)
+
+ logits, present = model.sampling_with_context(images=code_,
+ cross_attention_idxs=cross_attention_idxs,
+ cross_attention_layers=cross_attention_layers,
+ texts=tokens,
+ pos_images=pos_enc_code_,
+ pos_texts=pos_enc_tokens,
+ source_image=src_tokens,
+ use_fp16=use_fp16,
+ past=past,
+ prompt=prompt,
+ pos_prompt=pos_prompt)
+ logits = logits.to(dtype=torch.float32)
+ logits = logits / softmax_temperature
+
+ present = torch.stack(present).clone().detach()
+ if past is None:
+ past = [present]
+ else:
+ past.append(present)
+
+ logits = cutoff_topk_logits(logits, top_k)
+ probs = F.softmax(logits, dim=-1)
+ probs = cutoff_topp_probs(probs, top_p)
+
+ idx = torch.multinomial(probs, num_samples=1).clone().detach()
+ code = idx if code is None else torch.cat([code, idx], axis=1)
+
+ del past
+ return code
+
+
+@torch.no_grad()
+def sampling_igpt(model: torch.nn.Module,
+ sos: torch.FloatTensor,
+ top_k: Optional[float] = None,
+ top_p: Optional[float] = None,
+ softmax_temperature: float = 1.0,
+ is_tqdm: bool = True,
+ use_fp16: bool = True,
+ max_seq_len: int = 256) -> torch.LongTensor:
+ code = None
+ past = None
+ pbar = tqdm(range(max_seq_len), total=max_seq_len) if is_tqdm else range(max_seq_len)
+
+ for cnt, h in enumerate(pbar):
+ if code is None:
+ code_ = None
+ pos_enc_code_ = None
+ else:
+ code_ = code.clone().detach()
+ pos_enc_code_ = get_positional_encoding(code_, mode='1d')
+ code_ = code_[:, cnt-1].unsqueeze(-1)
+ pos_enc_code_ = pos_enc_code_[:, cnt-1].unsqueeze(-1)
+
+ logits, present = model.sampling(sos=sos,
+ codes=code_,
+ pos_codes=pos_enc_code_,
+ use_fp16=use_fp16,
+ past=past)
+ logits = logits.to(dtype=torch.float32)
+ logits = logits / softmax_temperature
+
+ present = torch.stack(present).clone().detach()
+ if past is None:
+ past = [present]
+ else:
+ past.append(present)
+
+ logits = cutoff_topk_logits(logits, top_k)
+ probs = F.softmax(logits, dim=-1)
+ probs = cutoff_topp_probs(probs, top_p)
+
+ idx = torch.multinomial(probs, num_samples=1).clone().detach()
+ code = idx if code is None else torch.cat([code, idx], axis=1)
+
+ del past
+ return code
diff --git a/dalle/utils/utils.py b/dalle/utils/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f0c4f5320a918330b63d0fac554e8c80050aaf9
--- /dev/null
+++ b/dalle/utils/utils.py
@@ -0,0 +1,131 @@
+# ------------------------------------------------------------------------------------
+# Minimal DALL-E
+# Copyright (c) 2021 KakaoBrain. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------------------
+
+import os
+import random
+import urllib
+import hashlib
+import tarfile
+import torch
+import clip
+import numpy as np
+from PIL import Image
+from torch.nn import functional as F
+from tqdm import tqdm
+import torchvision.utils as vutils
+import matplotlib.pyplot as plt
+
+
+def set_seed(seed: int):
+ random.seed(seed)
+ np.random.seed(seed)
+ torch.manual_seed(seed)
+ torch.cuda.manual_seed_all(seed)
+
+
+@torch.no_grad()
+def clip_score(prompt: str,
+ images: np.ndarray,
+ model_clip: torch.nn.Module,
+ preprocess_clip,
+ device: str) -> np.ndarray:
+ images = [preprocess_clip(Image.fromarray((image*255).astype(np.uint8))) for image in images]
+ images = torch.stack(images, dim=0).to(device=device)
+ texts = clip.tokenize(prompt).to(device=device)
+ texts = torch.repeat_interleave(texts, images.shape[0], dim=0)
+
+ image_features = model_clip.encode_image(images)
+ text_features = model_clip.encode_text(texts)
+
+ scores = F.cosine_similarity(image_features, text_features).squeeze()
+ rank = torch.argsort(scores, descending=True).cpu().numpy()
+ return rank
+
+
+def download(url: str, root: str) -> str:
+ os.makedirs(root, exist_ok=True)
+ filename = os.path.basename(url)
+ pathname = filename[:-len('.tar.gz')]
+
+ expected_md5 = url.split("/")[-2]
+ download_target = os.path.join(root, filename)
+ result_path = os.path.join(root, pathname)
+
+ if os.path.isfile(download_target) and (os.path.exists(result_path) and not os.path.isfile(result_path)):
+ return result_path
+
+ with urllib.request.urlopen(url) as source, open(download_target, 'wb') as output:
+ with tqdm(total=int(source.info().get('Content-Length')), ncols=80, unit='iB', unit_scale=True,
+ unit_divisor=1024) as loop:
+ while True:
+ buffer = source.read(8192)
+ if not buffer:
+ break
+
+ output.write(buffer)
+ loop.update(len(buffer))
+
+ if hashlib.md5(open(download_target, 'rb').read()).hexdigest() != expected_md5:
+ raise RuntimeError(f'Model has been downloaded but the md5 checksum does not not match')
+
+ with tarfile.open(download_target, 'r:gz') as f:
+ pbar = tqdm(f.getmembers(), total=len(f.getmembers()))
+ for member in pbar:
+ pbar.set_description(f'extracting: {member.name} (size:{member.size // (1024 * 1024)}MB)')
+ f.extract(member=member, path=root)
+
+ return result_path
+
+
+def realpath_url_or_path(url_or_path: str, root: str = None) -> str:
+ if urllib.parse.urlparse(url_or_path).scheme in ('http', 'https'):
+ return download(url_or_path, root)
+ return url_or_path
+
+
+def images_to_numpy(tensor):
+ generated = tensor.data.cpu().numpy().transpose(1,2,0)
+ generated[generated < -1] = -1
+ generated[generated > 1] = 1
+ generated = (generated + 1) / 2 * 255
+ return generated.astype('uint8')
+
+
+def save_image(ground_truth, images, out_dir, batch_idx):
+
+ for i, im in enumerate(images):
+ if len(im.shape) == 3:
+ plt.imsave(os.path.join(out_dir, 'test_%s_%s.png' % (batch_idx, i)), im)
+ else:
+ bs = im.shape[0]
+ # plt.imsave()
+ for j in range(bs):
+ plt.imsave(os.path.join(out_dir, 'test_%s_%s_%s.png' % (batch_idx, i, j)), im[j])
+
+
+ # print("Ground truth Images shape: ", ground_truth.shape, len(images))
+
+ # images = vutils.make_grid(images, nrow=ground_truth.shape[0])
+ # images = images_to_numpy(images)
+ #
+ # if ground_truth is not None:
+ # ground_truth = vutils.make_grid(ground_truth, 5)
+ # ground_truth = images_to_numpy(ground_truth)
+ # print("Ground Truth shape, Generated Images shape: ", ground_truth.shape, images.shape)
+ # images = np.concatenate([ground_truth, images], axis=0)
+ #
+ # output = Image.fromarray(images)
+ # output.save('%s/fake_samples_epoch_%03d.png' % (out_dir, batch_idx))
+
+ # if texts is not None:
+ # fid = open('%s/fake_samples_epoch_%03d_%03d.txt' % (image_dir, epoch, idx), 'w')
+ # for idx in range(images.shape[0]):
+ # fid.write(str(idx) + '--------------------------------------------------------\n')
+ # for i in range(len(texts)):
+ # fid.write(texts[i][idx] + '\n')
+ # fid.write('\n\n')
+ # fid.close()
+ return
\ No newline at end of file
diff --git a/demo/Barney.png b/demo/Barney.png
new file mode 100644
index 0000000000000000000000000000000000000000..202f3175ebfcd865668847cd1aa72ffd705d4aae
Binary files /dev/null and b/demo/Barney.png differ
diff --git a/demo/Betty.png b/demo/Betty.png
new file mode 100644
index 0000000000000000000000000000000000000000..1e10a0642ba7a1edd4728aff304f50ca1f9350bb
Binary files /dev/null and b/demo/Betty.png differ
diff --git a/demo/Crong.png b/demo/Crong.png
new file mode 100644
index 0000000000000000000000000000000000000000..e1d48d2455750bbc95f9b0f1a067e0920f6ad032
Binary files /dev/null and b/demo/Crong.png differ
diff --git a/demo/Dino.png b/demo/Dino.png
new file mode 100644
index 0000000000000000000000000000000000000000..d88c798f7e00933605d785b19eaff69edf35aa77
Binary files /dev/null and b/demo/Dino.png differ
diff --git a/demo/Eddy.png b/demo/Eddy.png
new file mode 100644
index 0000000000000000000000000000000000000000..9479d4170af85ce7682b729895e7a523896edb98
Binary files /dev/null and b/demo/Eddy.png differ
diff --git a/demo/Fred.png b/demo/Fred.png
new file mode 100644
index 0000000000000000000000000000000000000000..38679f5e8661d51409ef88619547faaea2b2cf55
Binary files /dev/null and b/demo/Fred.png differ
diff --git a/demo/Harry.png b/demo/Harry.png
new file mode 100644
index 0000000000000000000000000000000000000000..df502a7ce46aebb55aa3be87248ab6a78147681f
Binary files /dev/null and b/demo/Harry.png differ
diff --git a/demo/Loopy.png b/demo/Loopy.png
new file mode 100644
index 0000000000000000000000000000000000000000..fed0c25405b4eab38dd19c4bdf2df5976a712da2
Binary files /dev/null and b/demo/Loopy.png differ
diff --git a/demo/MrSlate.png b/demo/MrSlate.png
new file mode 100644
index 0000000000000000000000000000000000000000..51cc2b4205d213d68427c5604038615b583af8a3
Binary files /dev/null and b/demo/MrSlate.png differ
diff --git a/demo/Pebbles.png b/demo/Pebbles.png
new file mode 100644
index 0000000000000000000000000000000000000000..755ebbc4bc7e9e17a21cfda23c1bf69a76be4d67
Binary files /dev/null and b/demo/Pebbles.png differ
diff --git a/demo/Petty.png b/demo/Petty.png
new file mode 100644
index 0000000000000000000000000000000000000000..2ea5323684375f1de5e4a5615d6894c2f6d04c92
Binary files /dev/null and b/demo/Petty.png differ
diff --git a/demo/Poby.png b/demo/Poby.png
new file mode 100644
index 0000000000000000000000000000000000000000..a54dc035b28bc37afc8c157adb4f9dc16ad05ddb
Binary files /dev/null and b/demo/Poby.png differ
diff --git a/demo/Pororo.png b/demo/Pororo.png
new file mode 100644
index 0000000000000000000000000000000000000000..02505f97a53b8b1ae96a8cfe455d0cb5fd99a9bf
Binary files /dev/null and b/demo/Pororo.png differ
diff --git a/demo/Rody.png b/demo/Rody.png
new file mode 100644
index 0000000000000000000000000000000000000000..cba5e3550759a1b85fc27b6b323ca360b69b9d2a
Binary files /dev/null and b/demo/Rody.png differ
diff --git a/demo/Tongtong.png b/demo/Tongtong.png
new file mode 100644
index 0000000000000000000000000000000000000000..db6676a88380ec8fbbf7aa4195c24829e5680584
Binary files /dev/null and b/demo/Tongtong.png differ
diff --git a/demo/Wilma.png b/demo/Wilma.png
new file mode 100644
index 0000000000000000000000000000000000000000..712ee8e6135087c26aa7bbdd418a854e84a6983e
Binary files /dev/null and b/demo/Wilma.png differ
diff --git a/demo/get_source_frames.py b/demo/get_source_frames.py
new file mode 100644
index 0000000000000000000000000000000000000000..90dd15a83f24d92f65a5c1f07b59fbc874f01ee7
--- /dev/null
+++ b/demo/get_source_frames.py
@@ -0,0 +1,75 @@
+from PIL import Image
+import os
+import random
+import numpy as np
+import json
+
+pororo_source_frame_paths = {
+ 'Pororo': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH1_2/Pororo_ENGLISH1_2_ep6/12.png',
+ 'Loopy': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH1_1/Pororo_ENGLISH1_1_ep12/26.png',
+ 'Crong': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH1_1/Pororo_ENGLISH1_1_ep12/10.png',
+ 'Poby': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH1_1/Pororo_ENGLISH1_1_ep9/34.png',
+ 'Eddy': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH1_1/Pororo_ENGLISH1_1_ep12/46.png',
+ 'Petty': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH2_1/Pororo_ENGLISH2_1_ep1/34.png',
+ 'Tongtong': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH3_1/Pororo_ENGLISH3_1_ep7/8.png',
+ 'Rody': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH3_1/Pororo_ENGLISH3_1_ep6/66.png',
+ 'Harry': '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/Pororo_ENGLISH3_1/Pororo_ENGLISH3_1_ep7/39.png',
+}
+
+
+flintstones_source_frame_paths = {
+ "Wilma": '',
+ "Fred": '',
+ "Betty": '',
+ "Barney": '',
+ "Dino": '',
+ "Pebbles": '',
+ "Mr Slate": ''
+}
+
+
+def sample_image(im):
+ shorter, longer = min(im.size[0], im.size[1]), max(im.size[0], im.size[1])
+ video_len = int(longer / shorter)
+ se = np.random.randint(0, video_len, 1)[0]
+ return im.crop((0, se * shorter, shorter, (se + 1) * shorter))
+
+
+def get_pororo_source_frames():
+
+ # sample_image(Image.open(os.path.join(img_folder, tgt_img_path)).convert('RGB'))
+ # labels = np.load('../../StoryGAN/pororo_png/labels.npy', allow_pickle=True, encoding='latin1').item()
+ # for i in range(9):
+ # print(i)
+ # individual_frames = [(k, v) for k, v in labels.items() if v[i] == 1 and not any([v[j] == 1 for j in range(9) if j!=i])]
+ # print(random.sample(individual_frames, k=10))
+
+ for k, v in pororo_source_frame_paths.items():
+
+ img = sample_image(Image.open(v).convert('RGB'))
+ img.save(k + '.png')
+
+
+def get_flintstones_source_frames():
+
+ dir_path = '../../StoryGAN/flintstones'
+ annotations = json.load(open('../../StoryGAN/flintstones/flintstones_annotations_v1-0.json', 'r'))
+ for k in flintstones_source_frame_paths.keys():
+
+ if k != "Barney":
+ continue
+
+ character_frames = []
+ for sample in annotations:
+ sample_characters = [c["entityLabel"].strip().lower() for c in sample["characters"]]
+ if sample_characters[0] == k.lower():
+ character_frames.append(sample["globalID"])
+
+ globalID = random.choice(character_frames)
+ arr = np.load(os.path.join(dir_path, 'video_frames_sampled', globalID + '.npy'))
+ n_frames = arr.shape[0]
+ im = arr[random.randrange(n_frames)]
+ im = Image.fromarray(im)
+ im.save(k.replace(' ', '') + '.png')
+
+get_flintstones_source_frames()
\ No newline at end of file
diff --git a/demo/parse_captions.py b/demo/parse_captions.py
new file mode 100644
index 0000000000000000000000000000000000000000..d05dbf556eb30733e75e78f6e440153e4162707c
--- /dev/null
+++ b/demo/parse_captions.py
@@ -0,0 +1,35 @@
+import os
+import csv
+import numpy as np
+
+img_folder = '/playpen-ssd/adyasha/projects/StoryGAN/pororo_png/'
+def get_captions_by_split():
+
+ video_len = 4
+ descriptions_original = np.load(os.path.join(img_folder, 'descriptions.npy'), allow_pickle=True,
+ encoding='latin1').item()
+ followings = np.load(os.path.join(img_folder, 'following_cache4.npy'))
+
+ train_ids, val_ids, test_ids = np.load(os.path.join(img_folder, 'train_seen_unseen_ids.npy'), allow_pickle=True)
+ filenames = ['descriptions_train.csv', 'descriptions_val.csv', 'descriptions_test.csv']
+
+ for ids, filename in zip([train_ids, val_ids, test_ids], filenames):
+ im_ids = []
+ for src_img_id in ids:
+ tgt_img_paths = [str(followings[src_img_id][i])[2:-1] for i in range(video_len)]
+ tgt_img_ids = [str(tgt_img_path).replace(img_folder, '').replace('.png', '') for tgt_img_path in
+ tgt_img_paths]
+ im_ids.extend(tgt_img_ids)
+ # captions = [descriptions_original[tgt_img_id] for tgt_img_id in tgt_img_ids]
+
+ im_ids = list(set(im_ids))
+ im_ids.sort()
+
+ # captions = [descriptions_original[i] for i in im_ids]
+ with open(os.path.join(img_folder, filename), 'w') as csvfile:
+ # creating a csv writer object
+ csvwriter = csv.writer(csvfile)
+ for i in im_ids:
+ csvwriter.writerow([i, descriptions_original[i][0]])
+
+get_captions_by_split()
\ No newline at end of file
diff --git a/pororo_characters.png b/pororo_characters.png
new file mode 100644
index 0000000000000000000000000000000000000000..edf236bab521bc6e1c7d1bf9c13f11bef1e8b9a7
Binary files /dev/null and b/pororo_characters.png differ