--- license: mit language: - en pipeline_tag: text-generation tags: - legal - news library_name: transformers --- # GPT-Neo-1.3B SimCTG for Conditional News Generation [SimCTG](https://github.com/yxuansu/SimCTG) model (released by Su et.al. in this [paper](https://arxiv.org/abs/2202.06417)), leveraging [GPT-Neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B) (a large language model). ## Data Details It was trained on a large news corpus containing news content from 19 different publishers. Detailed dataset configuration is as follow: | Publisher | Data Number | | :--------------: | :---------: | | Guardian | 250,000 | | BBC | 240,872 | | WashingtonPost | 167,401 | | USAToday | 234,648 | | Reuters | 822,110 | | NYT (New York Times) | 245,150 | | CNBC | 231,060 | | Hill | 205,410 | | People | 132,630 | | CNN | 121,760 | | Vice | 97,750 | | Mashable | 91,100 | | Refinery | 84,100 | | BI (Business Insider) | 53,014 | | TechCrunch | 49,040 | | Verge | 48,327 | | TMZ | 46,490 | | Axios | 44,280 | | Vox | 44120 | ## Training Details We use the prompt template `Publisher: {vox} article: ` for training. We trained the model about 3 epochs on 3 NVIDIA A40 GPU. ## How to use ```python >>> from transformers import GPTNeoForCausalLM, AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("PahaII/gpt-neo-1.3b-simctg-NewsCtrlGen") >>> model = GPTNeoForCausalLM.from_pretrained("PahaII/gpt-neo-1.3b-simctg-NewsCtrlGen") >>> publisher = "Reuters" >>> assert publisher in ["Reuters", "NYT", "CNBC", "Hill", "People", "CNN", "Vice", "Mashable", "Refinery", "BI", "TechCrunch", "Verge", "TMZ", "Axios", "Vox", "Guardian", "BBCNews", "WashingtonPost", "USAToday"] >>> prompt = f"Publisher: {publisher.lower()} article: " >>> inputs = tokenizer(prompt, return_tensors="pt") >>> out = model.generate(**inputs, penalty_alpha=0.6) >>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0]) ```