Spaces:

eubinecto
/

idiomify

Runtime error

App Files Files Community

eubinecto commited on Mar 5, 2022

Commit

2bd8a1e

2 Parent(s): 7999e15 2b6388f

[#1] merging issue_1 to main

Browse files

Files changed (29) hide show

.gitignore +5 -0
README.md +8 -75
config.yaml +8 -0
explore/explore_bart.py +16 -0
explore/explore_bart_for_conditional_generation.py +10 -0
explore/explore_bart_logits_shape.py +39 -0
explore/explore_fetch_idioms.py +9 -0
explore/explore_fetch_literal2idiomatic.py +10 -0
explore/explore_fetch_pie.py +14 -0
explore/explore_fetch_seq2seq.py +10 -0
explore/explore_fetch_seq2seq_predict.py +19 -0
explore/explore_idiom2subwords.py +0 -0
explore/explore_idiomifydatamodule.py +26 -0
explore/explore_nlpaug.py +21 -0
explore/explore_src_builder.py +18 -0
explore/explore_tgt_builder.py +19 -0
idiomify/__init__.py +0 -0
idiomify/builders.py +87 -0
idiomify/data.py +78 -0
idiomify/fetchers.py +71 -0
idiomify/metrics.py +4 -0
idiomify/models.py +78 -0
idiomify/paths.py +17 -0
idiomify/urls.py +13 -0
main_infer.py +25 -0
main_train.py +56 -0
main_upload_idioms.py +37 -0
main_upload_literal2idiomatic.py +40 -0
requirements.txt +3 -0

.gitignore CHANGED Viewed

@@ -127,3 +127,8 @@ dmypy.json
 # Pyre type checker
 .pyre/

 # Pyre type checker
 .pyre/
+artifacts
+wandb
+.idea

README.md CHANGED Viewed

@@ -1,80 +1,13 @@
-# idiomify
-A human-inspired Idiomifier.
-# Proposal: Killing two birds with a human-inspired Idiomifier
-Date: January 22, 2022 3:42 AM
-keywords: idiomify, idioms, inductive biases, novel predictions
-## What are your research questions?
-Given the following two [](https://huggingface.co/bert-base-uncased)connectionsist models (two versions of a language model called BERT):
-| models | what task has it learned already? | what new task will they be taught? |
-| --- | --- | --- |
-| L1 Idiomifier  | Has been pre-trained with fill-in-the-blank task on English Wikipedia only (i.e. Monolingual  BERT) | Eng2Eng Idiomify task. |
-| L2 Idiomifer | Has been pre-trained with fill-in-the-blank task on Wikipedia in multiple languages, including English. (i.e. Multilingual BERT) | Eng2Eng Idiomify task. (the same) |
-where examples of Eng2Eng Idiomify task are:
-<img width="813" alt="image" src="https://user-images.githubusercontent.com/56193069/154847480-adacff57-68fc-40c1-af73-dab478f8ab19.png">
-I have the following two research questions:
-1. (SLA → NLP) If we have both of the models **decreamentally infer** the figurative meaning of idioms from their constituents, will this lead to an increased performance in Eng2Eng Idiomify task?
-2. (NLP → SLA) What differences can we observe from L1 & L2 Idiomifiers in how they learn Eng2Eng Idiomify task? From this, can we draw any **novel predictions** on how L1 & L2 learners might differ in learning idioms?
-## But why? what is your rationale?
-<img width="581" alt="image" src="https://user-images.githubusercontent.com/56193069/154847506-88c4283d-8a35-4c53-81c1-83c193ecf739.png">
-In short, the reason I have the two questions is to **kill two birds with one stone,** where the two birds are ***suggesting better biases***  and ***suggesting novel predictions***, and the stone is  ***designing a human-inspired Idiomifier**.*
-### What do you mean by the first bird, *suggest better biases*? (SLA → NLP)
-I think we could improve machines in processing idioms if we draw inspirations from how humans go about learning idioms. That is, if we could introduce human-inspired biases to machines, we may be able to improve their performance on figurative processing.
-<img width="800" alt="image" src="https://user-images.githubusercontent.com/56193069/154848885-0e40af8d-7554-429e-aff3-965e6121afec.png">
-But first, why do we even need to have machines better understand idioms?  It is because, although a huge progress has been made within Natural Language Processing (NLP) in recent years, **figurative processing has always been a “pain in the neck” in NLP, so to speak.** Take [BERT](https://arxiv.org/abs/1810.04805) as an example. It is a connectionists language model that can be finetuned to fill-in-the-blanks (top left), answer a question (top right), summarize a pargraph (bottom left), analyse sentiments (bottom right), etc. These are by no means easy tasks to machines, but as you can see from the examples above, the performance of BERT on these colloquial tasks are quite impressive.
-<img width="893" alt="image" src="https://user-images.githubusercontent.com/56193069/154848914-67a3aa0f-2171-433e-8a56-2187fff60f7c.png">
-However, when it comes to processing idioms, BERT is far from impressive. Without even getting into the literature, you can already see how replacing *get ready* (left) to *wet my gills* (right) substantially changes the predictions on fill-in-the-blanks task, although the two phrases essentially mean the same thing. Ideally, the probability distribution should stay more or less the same, but it doesn’t. This is because, as with many other language models, BERT falls short at processing the figures of speech.
-<img width="872" alt="image" src="https://user-images.githubusercontent.com/56193069/154848931-2b81a5fe-85b0-4868-bd20-d7326f83b9f3.png">
-Given that the goal of NLP is to “process all forms of natural language well” (Haagsma, 2020),  NLP researchers unanimously started to point out this problem in recent years. Just like how humans process natural language, a well-designed NLP unit should be able to process any forms of natural languge, whether it be formal (e.g. writing an email), colloquial (e.g. chatting with friends), canonical / structured (e.g. writing essays). While some success has been made in processing canonical language as we saw above, language models are “still far from revealing implicit meaning” of the figures of speech (Shawartz &  Dagan, 2019). Likewise, “Idiomatic meaning gets overpowered by compositional meaning of the expressions” ( Saxemna & Paul, 2020), partly because their constituents are more often found separately in many corpora than together as idioms. All in all, “figurative language is an important research area for computational & congnitive linguistics”, as ACL remarks in their report on 2020 workshop, which was aptly named, *Figurative Language Processing.*
-<img width="529" alt="image" src="https://user-images.githubusercontent.com/56193069/154848936-206d4d8a-3232-412c-91c6-62719207e1f0.png">
-So, there is a huge room for improvement in figurative language processing, but where do we get the ideas for the improvement? We could take various approaches to this, but Shawatz & Dagan suggest (2019) what I think is arguably the most sensible approach:  “get some inspiration from the way that **humans learn idioms”**.  We at least have a working answer in the human brain, however elusive it may be, so it is sensible to at least try to replicate this  in machines rather than to invent a completely new solution from scratch. It works in the human brain, so  it may as well work in connectionsists language models ( layers of artifical neural networks).  And this, this is what I mean by SLA could *suggest better biases* to NLP. That is, we could improve the performance of such language models on processing idioms, specifically BERT for my dissertation, by drawing inspirations (i.e. biases) from how humans learn idioms.
-<img width="676" alt="image" src="https://user-images.githubusercontent.com/56193069/154848943-c800b0ca-5ad1-437a-9590-46b6b5d5cfb2.png">
-What better biases have I found, then?  the Global Elaboration Hypothesis  posits (Levorato & Cacciari, 1995; karlson, 2019) that both L1 and L2 learners may start  learning idioms by first deducing the figurative meaning from the literal meaning, for those idioms that are yet to take place in their mental lexicon (vocabulary). It is not like they get the metaphor behing the literal interpretation right off the bat. However, as the learners age and contine learing those idioms, they gradually treat idioms as a single chunk and stop relying on analogies to get the figurative meaning. For example, when L2 learners of English encounter the idiom *throw the baby out with the bathwater* for the first time, their first reaction is to interpret the meaning literally, which they analogize with a given context to guess the figurative meaning, *to ignore potentially important things.* However, as they go along, the gradually stop imagining babies being thrown altogether with dirty water in their minds, and at the end, they don’t even think of babies when using *throw baby out with the bathwater* in its idioamtic sense - they  just use it as a single chunk at the end of their learning.
-If that’s how we go about learning idioms, that is, if humans use the literal interpretaion of idioms to “bootstrap” their understaning on the figurative meaning, so to speak, then there is nothing stopping us to expect that the bootstrapping bias as such may be useful for teaching idioms to machines.
-Hence, I believe it is sensible to ask the first question:
-1. (SLA → NLP) If we have both of the models **decreamentally infer** the figurative meaning of idioms from their constituents, will this lead to an increased performance in the Eng2Eng Idiomify task? If so, what would be the mathematical interpretation of such human-inspired success?
-### What do you mean by the second bird, *suggest novel predictions?  (NLP → SLA)*
-(work in progress)
-1. (NLP → SLA) What differences can we observe from L1 & L2 Idiomifiers in how they learn idioms? From this, **can we draw any novel predictons on how L1 & L2 learners learn idioms?**
-## Miscellaneous
-![KakaoTalk_Photo_2022-02-15-21-02-40](https://user-images.githubusercontent.com/56193069/154849076-0131f445-0131-49aa-bd77-6687adb94f5e.png)

+# Idiomify
+A human-inspired Idiomifier based on BERT
+<img width="807" alt="image" src="https://user-images.githubusercontent.com/56193069/153775460-5ca04edd-e788-442d-b0f1-e780dc0a5724.png">
+## Requirements
+- wandb
+- pytorch-lightning
+- transformers
+- pandas

config.yaml ADDED Viewed

	@@ -0,0 +1,8 @@

+tag011:
+  desc: just overfitting
+  bart: facebook/bart-base
+  lr: 0.0001
+  literal2idiomatic_ver: tag01
+  max_epochs: 100
+  batch_size: 100
+  shuffle: true

explore/explore_bart.py ADDED Viewed

	@@ -0,0 +1,16 @@

+from transformers import BartTokenizer, BartModel
+def main():
+    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
+    model = BartModel.from_pretrained('facebook/bart-large')
+    inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+    outputs = model(**inputs)
+    H_all = outputs.last_hidden_state  # noqa
+    print(H_all.shape)  # (1, 8, 1024)
+if __name__ == '__main__':
+    main()

explore/explore_bart_for_conditional_generation.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from transformers import BartTokenizer, BartForConditionalGeneration
+def main():
+    pass
+if __name__ == '__main__':
+    main()

explore/explore_bart_logits_shape.py ADDED Viewed

	@@ -0,0 +1,39 @@

+from transformers import BartTokenizer, BartForConditionalGeneration
+from data import IdiomifyDataModule
+CONFIG = {
+    "literal2idiomatic_ver": "pie_v0",
+    "batch_size": 20,
+    "num_workers": 4,
+    "shuffle": True
+}
+def main():
+    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
+    bart = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
+    datamodule = IdiomifyDataModule(CONFIG, tokenizer)
+    datamodule.prepare_data()
+    datamodule.setup()
+    for batch in datamodule.train_dataloader():
+        srcs, tgts_r, tgts = batch
+        input_ids, attention_mask = srcs[:, 0], srcs[:, 1]  # noqa
+        decoder_input_ids, decoder_attention_mask = tgts_r[:, 0], tgts_r[:, 1]
+        outputs = bart(input_ids=input_ids,
+                       attention_mask=attention_mask,
+                       decoder_input_ids=decoder_input_ids,
+                       decoder_attention_mask=decoder_attention_mask)
+        logits = outputs[0]
+        print(logits.shape)
+        """
+        torch.Size([20, 47, 50265])
+        (N, L, |V|)
+        """
+        break
+if __name__ == '__main__':
+    main()

explore/explore_fetch_idioms.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from idiomify.fetchers import fetch_idioms
+def main():
+    print(fetch_idioms("pie_v0"))
+if __name__ == '__main__':
+    main()

explore/explore_fetch_literal2idiomatic.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from idiomify.fetchers import fetch_literal2idiomatic
+def main():
+    for src, tgt in fetch_literal2idiomatic("pie_v0"):
+        print(src, "->", tgt)
+if __name__ == '__main__':
+    main()

explore/explore_fetch_pie.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from idiomify.fetchers import fetch_pie
+def main():
+    for idx, row in enumerate(fetch_pie()):
+        print(idx, row)
+        # the first 105 = V0.
+        if idx == 105:
+            break
+if __name__ == '__main__':
+    main()

explore/explore_fetch_seq2seq.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from idiomify.fetchers import fetch_seq2seq
+def main():
+    model = fetch_seq2seq("overfit")
+    print(model.bart.config)
+if __name__ == '__main__':
+    main()

explore/explore_fetch_seq2seq_predict.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from transformers import BartTokenizer
+from builders import SourcesBuilder
+from fetchers import fetch_seq2seq
+def main():
+    model = fetch_seq2seq("overfit")
+    tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
+    lit2idi = [
+        ("my man", ""),
+        ("hello", "")
+    ]  # just some dummy stuff
+    srcs = SourcesBuilder(tokenizer)(lit2idi)
+    out = model.predict(srcs=srcs)
+    print(out)
+if __name__ == '__main__':
+    main()

explore/explore_idiom2subwords.py ADDED Viewed

File without changes

explore/explore_idiomifydatamodule.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from transformers import BartTokenizer
+from idiomify.data import IdiomifyDataModule
+CONFIG = {
+    "literal2idiomatic_ver": "pie_v0",
+    "batch_size": 20,
+    "num_workers": 4,
+    "shuffle": True
+}
+def main():
+    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
+    datamodule = IdiomifyDataModule(CONFIG, tokenizer)
+    datamodule.prepare_data()
+    datamodule.setup()
+    for batch in datamodule.train_dataloader():
+        srcs, tgts_r, tgts = batch
+        print(srcs.shape)
+        print(tgts_r.shape)
+        print(tgts.shape)
+if __name__ == '__main__':
+    main()

explore/explore_nlpaug.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import nlpaug.augmenter.word as naw
+import nlpaug.augmenter.sentence as nas
+import nltk
+sent = "I am really happy with the new job and I mean that with sincere feeling"
+def main():
+    nltk.download("omw-1.4")
+    # this seems legit! I could definitely use this to increase the accuracy of the model
+    # for a few idioms (possibly ten, ten very different but frequent idioms)
+    aug = naw.ContextualWordEmbsAug()
+    augmented = aug.augment(sent, n=10)
+    print(augmented)
+if __name__ == '__main__':
+    main()

explore/explore_src_builder.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from transformers import BartTokenizer
+from idiomify.builders import SourcesBuilder
+BATCH = [
+    ("I could die at any moment", "I could kick the bucket at any moment"),
+    ("Speak plainly", "Don't beat around the bush")
+]
+def main():
+    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
+    builder = SourcesBuilder(tokenizer)
+    src = builder(BATCH)
+    print(src)
+if __name__ == '__main__':
+    main()

explore/explore_tgt_builder.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from transformers import BartTokenizer
+from idiomify.builders import TargetsBuilder
+BATCH = [
+    ("I could die at any moment", "I could kick the bucket at any moment"),
+    ("Speak plainly", "Don't beat around the bush")
+]
+def main():
+    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
+    builder = TargetsBuilder(tokenizer)
+    tgt_r, tgt = builder(BATCH)
+    print(tgt_r)
+    print(tgt)
+if __name__ == '__main__':
+    main()

idiomify/__init__.py ADDED Viewed

File without changes

idiomify/builders.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+all the functions for building tensors are defined here.
+builders must accept device as one of the parameters.
+"""
+import torch
+from typing import List, Tuple
+from transformers import BartTokenizer
+class TensorBuilder:
+    def __init__(self, tokenizer: BartTokenizer):
+        self.tokenizer = tokenizer
+    def __call__(self, *args, **kwargs) -> torch.Tensor:
+        raise NotImplementedError
+class Idiom2SubwordsBuilder(TensorBuilder):
+    def __call__(self, idioms: List[str], k: int) -> torch.Tensor:
+        """
+                1. The function takes in a list of idioms, and a maximum length of the input sequence.
+                2. It then splits the idioms into words, and pads the sequence to the maximum length.
+                3. It masks the padding tokens, and returns the input ids
+                :param idioms: a list of idioms, each of which is a list of tokens
+                :type idioms: List[str]
+                :param k: the maximum length of the idioms
+                :type k: int
+                :return: The input_ids of the idioms, with the pad tokens replaced by the mask token.
+        """
+        mask_id = self.tokenizer.mask_token_id
+        pad_id = self.tokenizer.pad_token_id
+        # temporarily disable single-token status of the idioms
+        idioms = [idiom.split(" ") for idiom in idioms]
+        encodings = self.tokenizer(text=idioms,
+                                   add_special_tokens=False,
+                                   # should set this to True, as we already have the idioms split.
+                                   is_split_into_words=True,
+                                   padding='max_length',
+                                   max_length=k,  # set to k
+                                   return_tensors="pt")
+        input_ids = encodings['input_ids']
+        input_ids[input_ids == pad_id] = mask_id
+        return input_ids
+class SourcesBuilder(TensorBuilder):
+    """
+    to be used for both training and inference
+    """
+    def __call__(self, literal2idiomatic: List[Tuple[str, str]]) -> torch.Tensor:
+        encodings = self.tokenizer(text=[literal for literal, _ in literal2idiomatic],
+                                   return_tensors="pt",
+                                   padding=True,
+                                   truncation=True,
+                                   add_special_tokens=True)
+        src = torch.stack([encodings['input_ids'],
+                           encodings['attention_mask']], dim=1)   # (N, 2, L)
+        return src  # (N, 2, L)
+class TargetsRightShiftedBuilder(TensorBuilder):
+    """
+    This is to be used only for training. As for inference, we don't need this.
+    """
+    def __call__(self, literal2idiomatic: List[Tuple[str, str]]) -> torch.Tensor:
+        encodings = self.tokenizer([
+            self.tokenizer.bos_token + idiomatic  # starts with bos, but does not end with eos (right-shifted)
+            for _, idiomatic in literal2idiomatic
+        ], return_tensors="pt", add_special_tokens=False, padding=True, truncation=True)
+        tgts_r = torch.stack([encodings['input_ids'],
+                              encodings['attention_mask']], dim=1)  # (N, 2, L)
+        return tgts_r
+class TargetsBuilder(TensorBuilder):
+    def __call__(self, literal2idiomatic: List[Tuple[str, str]]) -> torch.Tensor:
+        encodings = self.tokenizer([
+            idiomatic + self.tokenizer.eos_token  # no bos, but ends with eos
+            for _, idiomatic in literal2idiomatic
+        ], return_tensors="pt", add_special_tokens=False, padding=True, truncation=True)
+        tgts = encodings['input_ids']
+        return tgts  # (N, L)

idiomify/data.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import torch
+from typing import Tuple, Optional, List
+from torch.utils.data import Dataset, DataLoader
+from pytorch_lightning import LightningDataModule
+from wandb.sdk.wandb_run import Run
+from idiomify.fetchers import fetch_literal2idiomatic
+from idiomify.builders import SourcesBuilder, TargetsBuilder, TargetsRightShiftedBuilder
+from transformers import BartTokenizer
+class IdiomifyDataset(Dataset):
+    def __init__(self,
+                 srcs: torch.Tensor,
+                 tgts_r: torch.Tensor,
+                 tgts: torch.Tensor):
+        self.srcs = srcs  # (N, 2, L)
+        self.tgts_r = tgts_r  # (N, 2, L)
+        self.tgts = tgts  # (N, L)
+    def __len__(self) -> int:
+        """
+        Returning the size of the dataset
+        :return:
+        """
+        assert self.srcs.shape[0] == self.tgts_r.shape[0] == self.tgts.shape[0]
+        return self.srcs.shape[0]
+    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.LongTensor]:
+        """
+        Returns features & the label
+        :param idx:
+        :return:
+        """
+        return self.srcs[idx], self.tgts_r[idx], self.tgts[idx]
+class IdiomifyDataModule(LightningDataModule):
+    # boilerplate - just ignore these
+    def test_dataloader(self):
+        pass
+    def val_dataloader(self):
+        pass
+    def predict_dataloader(self):
+        pass
+    def __init__(self,
+                 config: dict,
+                 tokenizer: BartTokenizer,
+                 run: Run = None):
+        super().__init__()
+        self.config = config
+        self.tokenizer = tokenizer
+        self.run = run
+        # --- to be downloaded & built --- #
+        self.literal2idiomatic: Optional[List[Tuple[str, str]]] = None
+        self.dataset: Optional[IdiomifyDataset] = None
+    def prepare_data(self):
+        """
+        prepare: download all data needed for this from wandb to local.
+        """
+        self.literal2idiomatic = fetch_literal2idiomatic(self.config['literal2idiomatic_ver'], self.run)
+    def setup(self, stage: Optional[str] = None):
+        # --- set up the builders --- #
+        # build the datasets
+        srcs = SourcesBuilder(self.tokenizer)(self.literal2idiomatic)
+        tgts_r = TargetsRightShiftedBuilder(self.tokenizer)(self.literal2idiomatic)
+        tgts = TargetsBuilder(self.tokenizer)(self.literal2idiomatic)
+        self.dataset = IdiomifyDataset(srcs, tgts_r, tgts)
+    def train_dataloader(self) -> DataLoader:
+        return DataLoader(self.dataset, batch_size=self.config['batch_size'],
+                          shuffle=self.config['shuffle'], num_workers=self.config['num_workers'])

idiomify/fetchers.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import csv
+from os import path
+import yaml
+import wandb
+import requests
+from typing import Tuple, List
+from wandb.sdk.wandb_run import Run
+from idiomify.paths import CONFIG_YAML, idioms_dir, literal2idiomatic, seq2seq_dir
+from idiomify.urls import PIE_URL
+from transformers import AutoModelForSeq2SeqLM, AutoConfig
+from idiomify.models import Seq2Seq
+def fetch_pie() -> list:
+    text = requests.get(PIE_URL).text
+    lines = (line for line in text.split("\n") if line)
+    reader = csv.reader(lines)
+    next(reader)  # skip the header
+    return [
+        row
+        for row in reader
+    ]
+# --- from wandb --- #
+def fetch_idioms(ver: str, run: Run = None) -> List[str]:
+    """
+    why do you need this? -> you need this to have access to the idiom embeddings.
+    """
+    # if run object is given, we track the lineage of the data.
+    # if not, we get the dataset via wandb Api.
+    if run:
+        artifact = run.use_artifact(f"idioms:{ver}", type="dataset")
+    else:
+        artifact = wandb.Api().artifact(f"eubinecto/idiomify/idioms:{ver}", type="dataset")
+    artifact_dir = artifact.download(root=idioms_dir(ver))
+    txt_path = path.join(artifact_dir, "all.txt")
+    with open(txt_path, 'r') as fh:
+        return [line.strip() for line in fh]
+def fetch_literal2idiomatic(ver: str, run: Run = None) -> List[Tuple[str, str]]:
+    # if run object is given, we track the lineage of the data.
+    # if not, we get the dataset via wandb Api.
+    if run:
+        artifact = run.use_artifact(f"literal2idiomatic:{ver}", type="dataset")
+    else:
+        artifact = wandb.Api().artifact(f"eubinecto/idiomify/literal2idiomatic:{ver}", type="dataset")
+    artifact_dir = artifact.download(root=literal2idiomatic(ver))
+    tsv_path = path.join(artifact_dir, "all.tsv")
+    with open(tsv_path, 'r') as fh:
+        reader = csv.reader(fh, delimiter="\t")
+        return [(row[0], row[1]) for row in reader]
+def fetch_seq2seq(ver: str, run: Run = None) -> Seq2Seq:
+    if run:
+        artifact = run.use_artifact(f"seq2seq:{ver}", type="model")
+    else:
+        artifact = wandb.Api().artifact(f"eubinecto/idiomify/seq2seq:{ver}", type="model")
+    config = artifact.metadata
+    artifact_dir = artifact.download(root=seq2seq_dir(ver))
+    ckpt_path = path.join(artifact_dir, "model.ckpt")
+    bart = AutoModelForSeq2SeqLM.from_config(AutoConfig.from_pretrained(config['bart']))
+    alpha = Seq2Seq.load_from_checkpoint(ckpt_path, bart=bart)
+    return alpha
+def fetch_config() -> dict:
+    with open(str(CONFIG_YAML), 'r', encoding="utf-8") as fh:
+        return yaml.safe_load(fh)

idiomify/metrics.py ADDED Viewed

	@@ -0,0 +1,4 @@

+"""
+you may want to include bleu score.
+and more metrics for paraphrasing.
+"""

idiomify/models.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+The reverse dictionary models below are based off of: https://github.com/yhcc/BertForRD/blob/master/mono/model/bert.py
+"""
+from typing import Tuple
+import torch
+from torch.nn import functional as F
+import pytorch_lightning as pl
+from transformers import BartForConditionalGeneration, BartTokenizer
+from idiomify.builders import SourcesBuilder
+# for training
+class Seq2Seq(pl.LightningModule):  # noqa
+    """
+    the baseline is in here.
+    """
+    def __init__(self, bart: BartForConditionalGeneration, lr: float, bos_token_id: int, pad_token_id: int):  # noqa
+        super().__init__()
+        self.bart = bart
+        self.save_hyperparameters(ignore=["bart"])
+    def forward(self, srcs: torch.Tensor, tgts_r: torch.Tensor) -> torch.Tensor:
+        """
+        as for using bart for CG, refer to:
+        https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartForQuestionAnswering.forward
+        param srcs: (N, 2, L_s)
+        param tgts_r: (N, 2, L_t)
+        return: (N, L, |V|)
+        """
+        input_ids, attention_mask = srcs[:, 0], srcs[:, 1]
+        decoder_input_ids, decoder_attention_mask = tgts_r[:, 0], tgts_r[:, 1]
+        outputs = self.bart(input_ids=input_ids,
+                            attention_mask=attention_mask,
+                            decoder_input_ids=decoder_input_ids,
+                            decoder_attention_mask=decoder_attention_mask)
+        logits = outputs[0]  # (N, L, |V|)
+        return logits
+    def training_step(self, batch: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]) -> dict:
+        srcs, tgts_r, tgts = batch  # (N, 2, L_s), (N, 2, L_t), (N, 2, L_t)
+        logits = self.forward(srcs, tgts_r)  # -> (N, L, |V|)
+        logits = logits.transpose(1, 2)  # (N, L, |V|) -> (N, |V|, L)
+        loss = F.cross_entropy(logits, tgts, ignore_index=self.hparams['pad_token_id'])\
+                .sum()  # (N, L, |V|), (N, L) -> (N,) -> (1,)
+        return {
+            "loss": loss
+        }
+    def on_train_batch_end(self, outputs: dict, *args, **kwargs):
+        self.log("Train/Loss", outputs['loss'])
+    def configure_optimizers(self) -> torch.optim.Optimizer:
+        """
+        Instantiates and returns the optimizer to be used for this model
+        e.g. torch.optim.Adam
+        """
+        # The authors used Adam, so we might as well use it as well.
+        return torch.optim.AdamW(self.parameters(), lr=self.hparams['lr'])
+# for inference
+class Idiomifier:
+    def __init__(self, model: Seq2Seq, tokenizer: BartTokenizer):
+        self.model = model
+        self.builder = SourcesBuilder(tokenizer)
+        self.model.eval()
+    def __call__(self, src: str, max_length=100) -> str:
+        srcs = self.builder(literal2idiomatic=[(src, "")])
+        pred_ids = self.model.bart.generate(
+            inputs=srcs[:, 0],  # (N, 2, L) -> (N, L)
+            attention_mask=srcs[:, 1],  # (N, 2, L) -> (N, L)
+            decoder_start_token_id=self.model.hparams['bos_token_id'],
+            max_length=max_length,
+        ).squeeze()  # -> (N, L_t) -> (L_t)
+        tgt = self.builder.tokenizer.decode(pred_ids, skip_special_tokens=True)
+        return tgt

idiomify/paths.py ADDED Viewed

	@@ -0,0 +1,17 @@

+from pathlib import Path
+ROOT_DIR = Path(__file__).resolve().parent.parent
+ARTIFACTS_DIR = ROOT_DIR / "artifacts"
+CONFIG_YAML = ROOT_DIR / "config.yaml"
+def idioms_dir(ver: str) -> Path:
+    return ARTIFACTS_DIR / f"idioms-{ver}"
+def literal2idiomatic(ver: str) -> Path:
+    return ARTIFACTS_DIR / f"literal2idiomatic-{ver}"
+def seq2seq_dir(ver: str) -> Path:
+    return ARTIFACTS_DIR / f"seq2seq-{ver}"

idiomify/urls.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# EPIE dataset
+EPIE_IMMUTABLE_IDIOMS_TAGS_URL = "https://raw.githubusercontent.com/prateeksaxena2809/EPIE_Corpus/master/Static_Idioms_Corpus/Static_Idioms_Tags.txt"  # noqa
+EPIE_IMMUTABLE_IDIOMS_URL = "https://raw.githubusercontent.com/prateeksaxena2809/EPIE_Corpus/master/Static_Idioms_Corpus/Static_Idioms_Candidates.txt"  # noqa
+EPIE_IMMUTABLE_IDIOMS_CONTEXTS_URL = "https://raw.githubusercontent.com/prateeksaxena2809/EPIE_Corpus/master/Static_Idioms_Corpus/Static_Idioms_Words.txt"  # noqa
+EPIE_MUTABLE_IDIOMS_TAGS_URL = "https://raw.githubusercontent.com/prateeksaxena2809/EPIE_Corpus/master/Formal_Idioms_Corpus/Formal_Idioms_Tags.txt"  # noqa
+EPIE_MUTABLE_IDIOMS_URL = "https://raw.githubusercontent.com/prateeksaxena2809/EPIE_Corpus/master/Formal_Idioms_Corpus/Formal_Idioms_Candidates.txt"  # noqa
+EPIE_MUTABLE_IDIOMS_CONTEXTS_URL = "https://github.com/prateeksaxena2809/EPIE_Corpus/blob/master/Formal_Idioms_Corpus/Formal_Idioms_Words.txt"  # noqa
+# PIE dataset (Zhou, 2021)
+# https://aclanthology.org/2021.mwe-1.5/
+# right, let's just work on it.
+PIE_URL = "https://raw.githubusercontent.com/zhjjn/MWE_PIE/main/data_cleaned.csv"

main_infer.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import argparse
+from idiomify.models import Idiomifier
+from idiomify.fetchers import fetch_config, fetch_seq2seq
+from transformers import BartTokenizer
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ver", type=str, default="tag011")
+    parser.add_argument("--src", type=str,
+                        default="If there's any good to loosing my job,"
+                                " it's that I'll now be able to go to school full-time and finish my degree earlier.")
+    args = parser.parse_args()
+    config = fetch_config()[args.ver]
+    config.update(vars(args))
+    model = fetch_seq2seq(config['ver'])
+    tokenizer = BartTokenizer.from_pretrained(config['bart'])
+    idiomifier = Idiomifier(model, tokenizer)
+    src = config['src']
+    tgt = idiomifier(src=config['src'])
+    print(src, "\n->", tgt)
+if __name__ == '__main__':
+    main()

main_train.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import os
+import torch.cuda
+import wandb
+import argparse
+import pytorch_lightning as pl
+from termcolor import colored
+from pytorch_lightning.loggers import WandbLogger
+from transformers import BartTokenizer, BartForConditionalGeneration
+from idiomify.data import IdiomifyDataModule
+from idiomify.fetchers import fetch_config
+from idiomify.models import Seq2Seq
+from idiomify.paths import ROOT_DIR
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ver", type=str, default="tag011")
+    parser.add_argument("--num_workers", type=int, default=os.cpu_count())
+    parser.add_argument("--log_every_n_steps", type=int, default=1)
+    parser.add_argument("--fast_dev_run", action="store_true", default=False)
+    parser.add_argument("--upload", dest='upload', action='store_true', default=False)
+    args = parser.parse_args()
+    config = fetch_config()[args.ver]
+    config.update(vars(args))
+    if not config['upload']:
+        print(colored("WARNING: YOU CHOSE NOT TO UPLOAD. NOTHING BUT LOGS WILL BE SAVED TO WANDB", color="red"))
+    # prepare the model
+    bart = BartForConditionalGeneration.from_pretrained(config['bart'])
+    tokenizer = BartTokenizer.from_pretrained(config['bart'])
+    model = Seq2Seq(bart, config['lr'], tokenizer.bos_token_id, tokenizer.pad_token_id)
+    # prepare the datamodule
+    with wandb.init(entity="eubinecto", project="idiomify", config=config) as run:
+        datamodule = IdiomifyDataModule(config, tokenizer, run)
+        logger = WandbLogger(log_model=False)
+        trainer = pl.Trainer(max_epochs=config['max_epochs'],
+                             fast_dev_run=config['fast_dev_run'],
+                             log_every_n_steps=config['log_every_n_steps'],
+                             gpus=torch.cuda.device_count(),
+                             default_root_dir=str(ROOT_DIR),
+                             enable_checkpointing=False,
+                             logger=logger)
+        # start training
+        trainer.fit(model=model, datamodule=datamodule)
+        # upload the model to wandb only if the training is properly done  #
+        if not config['fast_dev_run'] and trainer.current_epoch == config['max_epochs'] - 1:
+            ckpt_path = ROOT_DIR / "model.ckpt"
+            trainer.save_checkpoint(str(ckpt_path))
+            artifact = wandb.Artifact(name="seq2seq", type="model", metadata=config)
+            artifact.add_file(str(ckpt_path))
+            run.log_artifact(artifact, aliases=["latest", config['ver']])
+            os.remove(str(ckpt_path))  # make sure you remove it after you are done with uploading it
+if __name__ == '__main__':
+    main()

main_upload_idioms.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+Here, what should you do here?
+just upload all idioms here - name it as epie.
+"""
+import os
+from idiomify.paths import ROOT_DIR
+from idiomify.fetchers import fetch_pie
+import argparse
+import wandb
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ver", type=str, default="tag01")
+    config = vars(parser.parse_args())
+    # get the idioms here
+    if config['ver'] == "tag01":
+        # only the first 106, and this is for piloting
+        idioms = set([row[0] for row in fetch_pie()[:106]])
+    else:
+        raise NotImplementedError
+    idioms = list(idioms)
+    with wandb.init(entity="eubinecto", project="idiomify", config=config) as run:
+        artifact = wandb.Artifact(name="idioms", type="dataset")
+        txt_path = ROOT_DIR / "all.txt"
+        with open(txt_path, 'w') as fh:
+            for idiom in idioms:
+                fh.write(idiom + "\n")
+        artifact.add_file(txt_path)
+        run.log_artifact(artifact, aliases=["latest", config['ver']])
+        os.remove(txt_path)
+if __name__ == '__main__':
+    main()

main_upload_literal2idiomatic.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+Here, what should you do here?
+just upload all idioms here - name it as epie.
+"""
+import csv
+import os
+from idiomify.paths import ROOT_DIR
+from idiomify.fetchers import fetch_pie
+import argparse
+import wandb
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ver", type=str, default="tag01")
+    config = vars(parser.parse_args())
+    # get the idioms here
+    if config['ver'] == "tag01":
+        # only the first 106, and we use this just for piloting
+        literal2idiom = [
+            (row[3], row[2]) for row in fetch_pie()[:106]
+        ]
+    else:
+        raise NotImplementedError
+    with wandb.init(entity="eubinecto", project="idiomify", config=config) as run:
+        artifact = wandb.Artifact(name="literal2idiomatic", type="dataset")
+        tsv_path = ROOT_DIR / "all.tsv"
+        with open(tsv_path, 'w') as fh:
+            writer = csv.writer(fh, delimiter="\t")
+            for row in literal2idiom:
+                writer.writerow(row)
+        artifact.add_file(tsv_path)
+        run.log_artifact(artifact, aliases=["latest", config['ver']])
+        os.remove(tsv_path)
+if __name__ == '__main__':
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+pytorch-lightning==1.5.10
+transformers==4.16.2
+wandb==0.12.10