sho-takase commited on
Commit
a08125f
1 Parent(s): 12fb7ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -63
README.md CHANGED
@@ -1,64 +1,64 @@
1
- ---
2
- license: mit
3
- language:
4
- - ja
5
- ---
6
-
7
- # Sarashina1-13B
8
-
9
- This repository provides Japanese language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
10
-
11
-
12
- ## How to use
13
-
14
- ```
15
- import torch
16
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
17
-
18
- model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina1-13b", torch_dtype=torch.float16)
19
- tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina1-13b", use_fast=False)
20
- generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")
21
- set_seed(123)
22
-
23
- text = generator(
24
- "おはようございます、今日の天気は",
25
- max_length=30,
26
- do_sample=True,
27
- pad_token_id=tokenizer.pad_token_id,
28
- num_return_sequences=3,
29
- )
30
-
31
- for t in text:
32
- print(t)
33
-
34
- ```
35
-
36
- ## Configuration
37
-
38
- | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
39
- | :-----: | :-----------: | :-------------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
40
- | [7B](https://huggingface.co/sbintuitions/sarashina1-7b) | 51200 | 1.0T | GPTNeoX | RoPE | 32 | 4096 | 32 |
41
- | [13B](https://huggingface.co/sbintuitions/sarashina1-13b) | 51200 | 1.0T | GPTNeoX | RoPE | 40 | 5120 | 40 |
42
- | [65B](https://huggingface.co/sbintuitions/sarashina1-65b) | 51200 | 800B | GPTNeoX | RoPE | 80 | 8192 | 64 |
43
-
44
- ## Training Corpus
45
-
46
- We used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
47
- To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
48
- After cleaning, our corpus contains about 550B tokens.
49
-
50
- ## Tokenization
51
-
52
- We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
53
- We do not apply pre-tokenization with Japanese tokenizer.
54
- Thus, a user may directly feed raw sentences into the tokenizer.
55
-
56
-
57
- ## Ethical Considerations and Limitations
58
- Sarashina1 has not been tuned to follow an instruction yet.
59
- Therefore, sarashina1 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
60
- Before using sarashina1, we would like developers to tune models based on human preferences and safety considerations.
61
-
62
- ## License
63
-
64
  [MIT License](https://huggingface.co/sbintuitions/sarashina1-13b/blob/main/LICENSE)
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ ---
6
+
7
+ # Sarashina1-13B
8
+
9
+ This repository provides Japanese language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
10
+
11
+
12
+ ## How to use
13
+
14
+ ```
15
+ import torch
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
17
+
18
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina1-13b", torch_dtype=torch.float16, device_map="auto")
19
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina1-13b", use_fast=False)
20
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
21
+ set_seed(123)
22
+
23
+ text = generator(
24
+ "おはようございます、今日の天気は",
25
+ max_length=30,
26
+ do_sample=True,
27
+ pad_token_id=tokenizer.pad_token_id,
28
+ num_return_sequences=3,
29
+ )
30
+
31
+ for t in text:
32
+ print(t)
33
+
34
+ ```
35
+
36
+ ## Configuration
37
+
38
+ | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
39
+ | :-----: | :-----------: | :-------------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
40
+ | [7B](https://huggingface.co/sbintuitions/sarashina1-7b) | 51200 | 1.0T | GPTNeoX | RoPE | 32 | 4096 | 32 |
41
+ | [13B](https://huggingface.co/sbintuitions/sarashina1-13b) | 51200 | 1.0T | GPTNeoX | RoPE | 40 | 5120 | 40 |
42
+ | [65B](https://huggingface.co/sbintuitions/sarashina1-65b) | 51200 | 800B | GPTNeoX | RoPE | 80 | 8192 | 64 |
43
+
44
+ ## Training Corpus
45
+
46
+ We used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
47
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
48
+ After cleaning, our corpus contains about 550B tokens.
49
+
50
+ ## Tokenization
51
+
52
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
53
+ We do not apply pre-tokenization with Japanese tokenizer.
54
+ Thus, a user may directly feed raw sentences into the tokenizer.
55
+
56
+
57
+ ## Ethical Considerations and Limitations
58
+ Sarashina1 has not been tuned to follow an instruction yet.
59
+ Therefore, sarashina1 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
60
+ Before using sarashina1, we would like developers to tune models based on human preferences and safety considerations.
61
+
62
+ ## License
63
+
64
  [MIT License](https://huggingface.co/sbintuitions/sarashina1-13b/blob/main/LICENSE)