---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- legal
- news
library_name: transformers
---
# GPT-Neo-1.3B SimCTG for Conditional News Generation
[SimCTG](https://github.com/yxuansu/SimCTG) model (released by Su et.al. in this [paper](https://arxiv.org/abs/2202.06417)), leveraging [GPT-Neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B) (a large language model).

## Data Details
It was trained on a large news corpus containing news content from 19 different publishers. Detailed dataset configuration is as follow: 

|    Publisher     | Data Number |
| :--------------: | :---------: |
|     Guardian     |   250,000   |
|       BBC        |   240,872   |
| WashingtonPost  |   167,401   |
|    USAToday     |   234,648   |
|     Reuters      |   822,110   |
|  NYT (New York Times)  |   245,150   |
|       CNBC       |   231,060   |
|       Hill       |   205,410   |
|      People      |   132,630   |
|       CNN        |   121,760   |
|       Vice       |   97,750    |
|     Mashable     |   91,100    |
|     Refinery     |   84,100    |
| BI (Business Insider) |   53,014    |
|    TechCrunch    |   49,040    |
|      Verge       |   48,327    |
|       TMZ        |   46,490    |
|      Axios       |   44,280    |
|       Vox        |    44120    |

## Training Details
We use the prompt template `Publisher: {vox} article: ` for training. We trained the model about 3 epochs on 3 NVIDIA A40 GPU.

## How to use
```python
>>> from transformers import GPTNeoForCausalLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("PahaII/gpt-neo-1.3b-simctg-NewsCtrlGen")
>>> model = GPTNeoForCausalLM.from_pretrained("PahaII/gpt-neo-1.3b-simctg-NewsCtrlGen")

>>> publisher = "Reuters"
>>> assert publisher in ["Reuters", "NYT", "CNBC", "Hill", "People", "CNN", "Vice", "Mashable", "Refinery", "BI", "TechCrunch", "Verge", "TMZ", "Axios", "Vox", "Guardian", "BBCNews", "WashingtonPost", "USAToday"]
>>> prompt = f"Publisher: {publisher.lower()} article: "

>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> out = model.generate(**inputs, penalty_alpha=0.6)
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
```