File size: 8,153 Bytes
f911c74
 
91aa93e
 
 
 
 
 
 
 
 
f911c74
91aa93e
6928e49
91aa93e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f99bc3c
91aa93e
f99bc3c
91aa93e
 
 
 
 
 
f99bc3c
91aa93e
 
 
7324da2
 
91aa93e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7324da2
91aa93e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d68d4d
91aa93e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d88049
91aa93e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4278431
cd8d687
91aa93e
 
479faa4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: mit
language:
- de
tags:
- gptj
- title generation
- headline-generation
- teaser generation
- news
inference: false
---

# GPT-J-Title-Teaser-10k

<!-- Provide a quick summary of what the model is/does. -->

gptj-title-teaser-10k  
Version 1.0 / 22 December 2022

A fine-tuned version of the [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit) model for generating titles and teasers for news.

# Model Details

## Model Description

<!-- Provide a longer summary of what this model is. -->

Test generation capabilities here: https://snipaid.tech  

A GPT-J model finetuned on german language news using a causal language modeling (CLM) objective.  

GPT-J is a transformers model pretrained on a very large corpus of english data [The Pile](https://huggingface.co/datasets/the_pile) in a self-supervised fashion. 
This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) 
with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.  

Inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. 
The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.  

The pretrained model learns an inner representation of the english language that can then be used to extract features useful for downstream tasks. 
The model is best at what it was pretrained for however, which is generating texts from a prompt. 
A prompt is a piece of text inserted in the input examples, so that the original task can be formulated as a (masked) language modeling problem.  

To fit the model to the domain of german news for the downstream task of title and teaser generation it was finetuned on a dataset with 10,000 german news articles in a multi-task finetuning fashion. 
Hence the finetuned models name drives from the model it was finetuned from (gptj), the downstream generation tasks (title, teaser) and the size of the finetuning dataset (10k).

- **Developed by:** snipaid
- **Model type:** gptj
- **Language(s) (NLP):** de
- **License:** MIT
- **Finetuned from model:** [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit)

# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

The model is intended for generating titles and teasers of news documents.
  
News document: A news story's fulltext in plain text.  
Title: A few words that reflect the essence of the news story, also known as headline.  
Teaser: A few sentences that spark curiousity about the "best of the rest" of the news story.

## Direct Use and how to get started with the model

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

The model is built on [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit) to make the model usable and fine-tunable on a single GPU with ~11 GB memory.
Running it requires some utility code for the 8 bit quantization and loRa adapters.

Here's how to get started: [![colab](https://camo.githubusercontent.com/84f0493939e0c4de4e6dbe113251b4bfb5353e57134ffd9fcab6b8714514d4d1/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/drive/1-FdkAL5RYaNRkaY3cFRc_TY5yv3Scxdo)

## Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

Misuse: 
* Generating and spreading misinformation
* Generating content that is discriminating, violent or otherwise harmful

Use cases the model will not work well for:
* Generating snippets other than title and teaser

# Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The base model GPT-J was trained on the Pile, a dataset scraped from many different websites.
This dataset is known contain profanity, lewd, and otherwise abrasive language alongside certain biases.
Fine-tuning does not eliminate those risks and biases. Depending upon input gptj-title-teaser-10k may produce socially unacceptable output.
To learn more about biases in the Pile see [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027).


## Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
When generating text with the model please keep in mind, that the statistically most likely next token or word often does not produce the most "accurate" text. 
Never depend upon those models to produce factually accurate output! 
We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to ensure the quality of the generared output.
For further information see [limitations and biases of GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B#limitations-and-biases).

# Training Details

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was finetuned on a collection of 10,000 news items scraped from different online news outlets* in german language.

\* *Namely: Speedweek, n-tv, Welt, Tagesspiegel, Faz, Merkur, Bild, Focus, Rp-Online, Freie Presse, Weser-Kurier, Tz, Stern, Kicker, Taz, Schwäbische Zeitung, Frankfurter Rundschau, Stuttgarter Zeitung, Abendzeitung, Donaukurier, Hessische Neidersächsiche Allgemeine, Kreiszeitung, Heise Online, Augsburger Allgemeine, SPOX, Nordbayern, Offenbach Post Online, inFranken, Westfälischer Anzeiger, Tagesschau, Nordkurier, Wallstreet online, Computer Bild, Die Rheinlandpfalz, Morgenweb, Bunte, Sport1, LR-Online, Gala, Wirtschaftswoche, Chip, Brigitte, NWZ Online.*

For each news item the dataset contains title, teaser and fulltext.

```
[
 {
    "title": ...,
    "teaser": ...,
    "fulltext": ...
  },
]
```

The dataset contains news items within the categories of sports, politics, panorama, culture, technology, health, knowledge, cars, travel, economy and other in equal proportions.

## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The model was finetuned using a causal language modeling (CLM) objective for multitask finetuning.

### Preprocessing

For each news item, two inputs were concatenated like below.

```
f"[Text]: {item.fulltext} \n [Title]: {item.title}"
f"[Text]: {item.fulltext} \n [Teaser]: {item.teaser}"
```
This results in one input per task for each news item.

*Note: The inserted prompt "[Text]:" marks the beginning of the news item's fulltext.  
In the same manner "[Title]:" prompts the news item's title and "[Teaser]:" the news item's teaser.*

# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** A100 SXM4
- **Hours used:** 27h 42min
- **Cloud Provider:** Vast.ai
- **Compute Region:** Unknown
- **Carbon Emitted:** ~4.79kg co2e

# Glossary

**News Document**, plain text form of a news article or news item.  
**News Item**, aka news article. A particular piece of news, usually from a journalistic source.  
**Snippet**, a small section of text that is related to a news document.  
**Title** aka headline. A few words that reflect the essence of the news story.  
**Teaser** aka lede. A few sentences that spark curiosity about the "best of the rest" of the news story.