File size: 8,153 Bytes

---
license: mit
language:
- de
tags:
- gptj
- title generation
- headline-generation
- teaser generation
- news
inference: false
---

# GPT-J-Title-Teaser-10k

<!-- Provide a quick summary of what the model is/does. -->

gptj-title-teaser-10k  
Version 1.0 / 22 December 2022

A fine-tuned version of the [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit) model for generating titles and teasers for news.

# Model Details

## Model Description

<!-- Provide a longer summary of what this model is. -->

Test generation capabilities here: https://snipaid.tech  

A GPT-J model finetuned on german language news using a causal language modeling (CLM) objective.  

GPT-J is a transformers model pretrained on a very large corpus of english data [The Pile](https://huggingface.co/datasets/the_pile) in a self-supervised fashion. 
This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) 
with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.  

Inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. 
The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.  

The pretrained model learns an inner representation of the english language that can then be used to extract features useful for downstream tasks. 
The model is best at what it was pretrained for however, which is generating texts from a prompt. 
A prompt is a piece of text inserted in the input examples, so that the original task can be formulated as a (masked) language modeling problem.  

To fit the model to the domain of german news for the downstream task of title and teaser generation it was finetuned on a dataset with 10,000 german news articles in a multi-task finetuning fashion. 
Hence the finetuned models name drives from the model it was finetuned from (gptj), the downstream generation tasks (title, teaser) and the size of the finetuning dataset (10k).

- **Developed by:** snipaid
- **Model type:** gptj
- **Language(s) (NLP):** de
- **License:** MIT
- **Finetuned from model:** [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit)

# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

The model is intended for generating titles and teasers of news documents.
  
News document: A news story's fulltext in plain text.  
Title: A few words that reflect the essence of the news story, also known as headline.  
Teaser: A few sentences that spark curiousity about the "best of the rest" of the news story.

## Direct Use and how to get started with the model

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

The model is built on [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit) to make the model usable and fine-tunable on a single GPU with ~11 GB memory.
Running it requires some utility code for the 8 bit quantization and loRa adapters.

Here's how to get started: [![colab](https://camo.githubusercontent.com/84f0493939e0c4de4e6dbe113251b4bfb5353e57134ffd9fcab6b8714514d4d1/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/drive/1-FdkAL5RYaNRkaY3cFRc_TY5yv3Scxdo)

## Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

Misuse: 
* Generating and spreading misinformation
* Generating content that is discriminating, violent or otherwise harmful

Use cases the model will not work well for:
* Generating snippets other than title and teaser

# Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The base model GPT-J was trained on the Pile, a dataset scraped from many different websites.
This dataset is known contain profanity, lewd, and otherwise abrasive language alongside certain biases.
Fine-tuning does not eliminate those risks and biases. Depending upon input gptj-title-teaser-10k may produce socially unacceptable output.
To learn more about biases in the Pile see [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027).


## Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
When generating text with the model please keep in mind, that the statistically most likely next token or word often does not produce the most "accurate" text. 
Never depend upon those models to produce factually accurate output! 
We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to ensure the quality of the generared output.
For further information see [limitations and biases of GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B#limitations-and-biases).

# Training Details

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was finetuned on a collection of 10,000 news items scraped from different online news outlets* in german language.

\* *Namely: Speedweek, n-tv, Welt, Tagesspiegel, Faz, Merkur, Bild, Focus, Rp-Online, Freie Presse, Weser-Kurier, Tz, Stern, Kicker, Taz, Schwäbische Zeitung, Frankfurter Rundschau, Stuttgarter Zeitung, Abendzeitung, Donaukurier, Hessische Neidersächsiche Allgemeine, Kreiszeitung, Heise Online, Augsburger Allgemeine, SPOX, Nordbayern, Offenbach Post Online, inFranken, Westfälischer Anzeiger, Tagesschau, Nordkurier, Wallstreet online, Computer Bild, Die Rheinlandpfalz, Morgenweb, Bunte, Sport1, LR-Online, Gala, Wirtschaftswoche, Chip, Brigitte, NWZ Online.*

For each news item the dataset contains title, teaser and fulltext.

```
[
 {
    "title": ...,
    "teaser": ...,
    "fulltext": ...
  },
]
```

The dataset contains news items within the categories of sports, politics, panorama, culture, technology, health, knowledge, cars, travel, economy and other in equal proportions.

## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The model was finetuned using a causal language modeling (CLM) objective for multitask finetuning.

### Preprocessing

For each news item, two inputs were concatenated like below.

```
f"[Text]: {item.fulltext} \n [Title]: {item.title}"
f"[Text]: {item.fulltext} \n [Teaser]: {item.teaser}"
```
This results in one input per task for each news item.

*Note: The inserted prompt "[Text]:" marks the beginning of the news item's fulltext.  
In the same manner "[Title]:" prompts the news item's title and "[Teaser]:" the news item's teaser.*

# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** A100 SXM4
- **Hours used:** 27h 42min
- **Cloud Provider:** Vast.ai
- **Compute Region:** Unknown
- **Carbon Emitted:** ~4.79kg co2e

# Glossary

**News Document**, plain text form of a news article or news item.  
**News Item**, aka news article. A particular piece of news, usually from a journalistic source.  
**Snippet**, a small section of text that is related to a news document.  
**Title** aka headline. A few words that reflect the essence of the news story.  
**Teaser** aka lede. A few sentences that spark curiosity about the "best of the rest" of the news story.