crumb commited on
Commit
65bf5c4
·
1 Parent(s): afefcd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -18
README.md CHANGED
@@ -7,37 +7,35 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- ## Dante-{small,medium,large}
11
 
12
- Dante is a family of three [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1)-derivative decoder-only transformer models that drop the middle layers of the original Mistral-7b model. The range of the layers dropped for each model are `[15:-8]`, `[10:-3]`, and `[2:-2]` corresponding to large, medium, and small.
13
 
14
  ![](graphic.png)
15
 
16
- | Model name | Parameters | Layers kept from original Mistral |
17
  | --- | --- | --- |
18
- | [crumbly/dante-large](https://hf.co/crumbly/dante-large) | 5.1B | 23/32 |
19
- | [crumbly/dante-medium](https://hf.co/crumbly/dante-medium) | 3B | 13/32 |
20
- | [crumbly/dante-small](https://hf.co/crumbly/dante-small) | 1B | 4/32 |
21
 
22
- The models were then finetuned with high-rank adapters in nf4 precision on a randomized small subset of a dataset of high-quality web documents, to 'set' the weights in place and allow the model to generate coherent text.
23
 
24
  ## Virgil Dataset
25
 
26
- Crumbly's Virgil dataset is a dataset of high quality up-to-date English text documents and code for 'setting' architectural changes in warm-started or pretrained models *(e.g. Dante)* Utilizing warm-started or pretrained model's weights as parts of new models is a cheap and efficient way to create a new model with the world-knowledge of the previous model, rather than pretraining from-scratch. Most weights can be utilized by most types of Transformers in some capacity, if adapted to utilize them *(frozen or unfrozen)*, and high-quality documents are a *must* to be able to preserve the skills of the doner model.
27
 
28
- The approximate distribution of the Virgil dataset is as follows.
29
-
30
- | subset | approximate % of tokens |
31
  | --- | --- |
32
- | papers | 21.65% |
33
- | github | 35.34% |
34
- | books | 23.08% |
35
- | wiki | 3.56% |
36
- | webtext | 16.36% |
37
 
38
- **Bias**: This dataset contains text scraped from the internet including erotic content and harmful stereotypes, a note is to be made about mitigating these kinds of completions at inference-time.
39
 
40
- Only random 1k token windows of a small randomized 2% subset of Virgil is used for setting the Dante models, as multiple billions of tokens would take a very long time to train into any transformer model, especially on consumer graphics cards. (Crumbly's compute is a single 2xA6000 Lambdalabs Vector Workstation. Highly recommend.) The dataset is not shared publicly.
41
 
42
  ---
43
 
 
7
  pinned: false
8
  ---
9
 
10
+ ## Dante Models: {Small, Medium, Large}
11
 
12
+ Dante comprises three decoder-only transformer models derived from [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1), with varying layers dropped from the original Mistral-7b model: `[15:-8]`, `[10:-3]`, and `[2:-2]` for large, medium, and small respectively.
13
 
14
  ![](graphic.png)
15
 
16
+ | Model | Parameters | Retained Layers |
17
  | --- | --- | --- |
18
+ | [Dante-Large](https://hf.co/crumbly/dante-large) | 5.1B | 23/32 |
19
+ | [Dante-Medium](https://hf.co/crumbly/dante-medium) | 3B | 13/32 |
20
+ | [Dante-Small](https://hf.co/crumbly/dante-small) | 1B | 4/32 |
21
 
22
+ Models were fine-tuned with high-rank adapters on a small randomized subset of high-quality web documents to ensure coherent text generation.
23
 
24
  ## Virgil Dataset
25
 
26
+ Virgil dataset, by Crumbly, consists of updated English text and code to fine-tune models like Dante which need to "set" their architectural changes in place. It's an efficient approach to leverage prior model knowledge instead of starting from scratch.
27
 
28
+ | Subset | Token % |
 
 
29
  | --- | --- |
30
+ | Papers | 21.65% |
31
+ | GitHub | 35.34% |
32
+ | Books | 23.08% |
33
+ | Wiki | 3.56% |
34
+ | Webtext | 16.36% |
35
 
36
+ **Bias Alert**: Contains internet-sourced text including potentially offensive content. Measures should be taken to mitigate biases during inference.
37
 
38
+ A small 2% subset of Virgil, specifically random 1k token windows, is used to set the Dante models due to the extensive time required to train on larger datasets with Crumbly's compute setup (2xA6000 Lambdalabs Vector Workstation). The dataset isn't publicly shared.
39
 
40
  ---
41