Update README.md
Browse files
README.md
CHANGED
@@ -2,5 +2,152 @@
|
|
2 |
pipeline_tag: text-generation
|
3 |
tags:
|
4 |
- text-generation-inference
|
|
|
|
|
5 |
library_name: transformers
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
pipeline_tag: text-generation
|
3 |
tags:
|
4 |
- text-generation-inference
|
5 |
+
- backpack
|
6 |
+
- backpackmodel
|
7 |
library_name: transformers
|
8 |
+
license: apache-2.0
|
9 |
+
datasets:
|
10 |
+
- openwebtext
|
11 |
+
language:
|
12 |
+
- en
|
13 |
+
---
|
14 |
+
|
15 |
+
---
|
16 |
+
|
17 |
+
---
|
18 |
+
|
19 |
+
|
20 |
+
# Model Card for Backpack-GPT2
|
21 |
+
|
22 |
+
<!-- Provide a quick summary of what the model is/does. [Optional] -->
|
23 |
+
The Backpack-GPT2 language model is an instance of the [Backpack architecture](https://arxiv.org/abs/2305.16765), intended to combine strong modeling performance with an interface for interpretability and control.
|
24 |
+
Most details about this model and its training should be accessed in the paper, [Backpack Language Models](https://arxiv.org/abs/2305.16765).
|
25 |
+
|
26 |
+
See also [backpackmodels.science](backpackmodels.science).
|
27 |
+
|
28 |
+
# Table of Contents
|
29 |
+
|
30 |
+
- [Model Card for Backpack-GPT2](#model-card-for--model_id-)
|
31 |
+
- [Table of Contents](#table-of-contents)
|
32 |
+
- [Model Details](#model-details)
|
33 |
+
- [Model Description](#model-description)
|
34 |
+
- [Uses](#uses)
|
35 |
+
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
36 |
+
- [Training Details](#training-details)
|
37 |
+
- [Training Data](#training-data)
|
38 |
+
- [Training Procedure](#training-procedure)
|
39 |
+
- [Environmental Impact](#environmental-impact)
|
40 |
+
- [Technical Specifications [optional]](#technical-specifications-optional)
|
41 |
+
- [Model Architecture and Objective](#model-architecture-and-objective)
|
42 |
+
- [Compute Infrastructure](#compute-infrastructure)
|
43 |
+
- [Hardware](#hardware)
|
44 |
+
- [Software](#software)
|
45 |
+
- [Citation](#citation)
|
46 |
+
- [Model Card Authors [optional]](#model-card-authors-optional)
|
47 |
+
- [Model Card Contact](#model-card-contact)
|
48 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
49 |
+
|
50 |
+
|
51 |
+
# Model Details
|
52 |
+
|
53 |
+
## Model Description
|
54 |
+
|
55 |
+
<!-- Provide a longer summary of what this model is/does. -->
|
56 |
+
The Backpack-GPT2 is a [Backpack-based language model](https://arxiv.org/abs/2305.16765), an architecture intended to combine strong modeling performance with an interface for interpretability and control.
|
57 |
+
|
58 |
+
- **Developed by:** John Hewitt, John Thickstun, Christopher D. Manning, Percy Liang
|
59 |
+
- **Shared by [Optional]:** More information needed
|
60 |
+
- **Model type:** Language model
|
61 |
+
- **Language(s) (NLP):** en
|
62 |
+
- **License:** apache-2.0
|
63 |
+
- **Resources for more information:**
|
64 |
+
- [GitHub Repo](https://github.com/john-hewitt/backpacks-flash-attn)
|
65 |
+
- [Associated Paper](https://huggingface.co/datasets/openwebtext)
|
66 |
+
|
67 |
+
# Uses
|
68 |
+
|
69 |
+
This model is intended for use in the study and development of increasingly interpretable methods in natural language processing.
|
70 |
+
It is not directly fit for any production use.
|
71 |
+
|
72 |
+
|
73 |
+
# Bias, Risks, and Limitations
|
74 |
+
|
75 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
76 |
+
|
77 |
+
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
|
78 |
+
This model in particular is limited in its capabilities, and with a brand new architecture, less is known about its biases than, e.g., Transformer-based models.
|
79 |
+
|
80 |
+
|
81 |
+
# Training Details
|
82 |
+
|
83 |
+
## Training Data
|
84 |
+
|
85 |
+
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
86 |
+
|
87 |
+
This model was trained on the [OpenWebText](https://huggingface.co/datasets/openwebtext) corpus.
|
88 |
+
|
89 |
+
|
90 |
+
## Training Procedure
|
91 |
+
|
92 |
+
This model was trained for 100k gradient steps with a batch size of 512k tokens and a linearly decaying learning rate from 6e-4 to zero, with a linear warmup of 5k steps.
|
93 |
+
|
94 |
+
# Environmental Impact
|
95 |
+
|
96 |
+
- **Hardware Type:** 4 A100 GPUs (40G)
|
97 |
+
- **Hours used:** Roughly 4 days.
|
98 |
+
- **Cloud Provider:** Stanford compute.
|
99 |
+
- **Compute Region:** Stanford energy grid.
|
100 |
+
|
101 |
+
## Model Architecture and Objective
|
102 |
+
|
103 |
+
This model was trained to minimize the cross-entropy loss, and is a [Backpack language model](https://arxiv.org/pdf/2305.16765.pdf).
|
104 |
+
|
105 |
+
## Compute Infrastructure
|
106 |
+
|
107 |
+
This model was trained on a slurm cluster.
|
108 |
+
|
109 |
+
### Hardware
|
110 |
+
|
111 |
+
This model was trained on 4 A100s.
|
112 |
+
|
113 |
+
### Software
|
114 |
+
|
115 |
+
This model was trained with [FlashAttention](https://github.com/HazyResearch/flash-attention) and [PyTorch](https://pytorch.org/)
|
116 |
+
|
117 |
+
# Citation
|
118 |
+
|
119 |
+
**BibTeX:**
|
120 |
+
|
121 |
+
```
|
122 |
+
@InProceedings{hewitt2023backpack,
|
123 |
+
author = "Hewitt, John and Thickstun, John and Manning, Christopher D. and Liang, Percy",
|
124 |
+
title = "Backpack Language Models",
|
125 |
+
booktitle = "Proceedings of the Association for Computational Linguistics",
|
126 |
+
year = "2023",
|
127 |
+
publisher = "Association for Computational Linguistics",
|
128 |
+
location = "Toronto, Canada",
|
129 |
+
}
|
130 |
+
```
|
131 |
+
|
132 |
+
|
133 |
+
# Model Card Authors [optional]
|
134 |
+
|
135 |
+
<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
|
136 |
+
|
137 |
+
John Hewitt
|
138 |
+
|
139 |
+
# Model Card Contact
|
140 |
+
|
141 | |
142 |
+
|
143 |
+
# How to Get Started with the Model
|
144 |
+
|
145 |
+
|
146 |
+
|
147 |
+
|
148 |
+
<details>
|
149 |
+
<summary> Click to expand </summary>
|
150 |
+
|
151 |
+
More information needed
|
152 |
+
|
153 |
+
</details>
|