sho-takase commited on
Commit
0268beb
1 Parent(s): 68c91cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -66
README.md CHANGED
@@ -1,67 +1,67 @@
1
- ---
2
- license: mit
3
- language:
4
- - ja
5
- - en
6
- ---
7
-
8
- # Sarashina2-13B
9
-
10
- This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
-
12
-
13
- ## How to use
14
-
15
- ```
16
- import torch
17
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
18
-
19
- model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-13b", torch_dtype=torch.bfloat16)
20
- tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-13b", use_fast=False)
21
- generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")
22
- set_seed(123)
23
-
24
- text = generator(
25
- "おはようございます、今日の天気は",
26
- max_length=30,
27
- do_sample=True,
28
- pad_token_id=tokenizer.pad_token_id,
29
- num_return_sequences=3,
30
- )
31
-
32
- for t in text:
33
- print(t)
34
-
35
- ```
36
-
37
- ## Configuration
38
-
39
- | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
40
- | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
41
- | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
42
- | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
43
- | 70B (TBA)| | | | | | |
44
-
45
- ## Training Corpus
46
-
47
- For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
48
- To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
49
- After cleaning, our Japanese training data contains about 1T tokens.
50
-
51
- For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
52
-
53
- ## Tokenization
54
-
55
- We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
56
- We do not apply pre-tokenization with Japanese tokenizer.
57
- Thus, a user may directly feed raw sentences into the tokenizer.
58
-
59
-
60
- ## Ethical Considerations and Limitations
61
- Sarashina2 has not been tuned to follow an instruction yet.
62
- Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
63
- Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
64
-
65
- ## License
66
-
67
  [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ - en
6
+ ---
7
+
8
+ # Sarashina2-13B
9
+
10
+ This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
+
12
+
13
+ ## How to use
14
+
15
+ ```
16
+ import torch
17
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
18
+
19
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-13b", torch_dtype=torch.bfloat16, device_map="auto")
20
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-13b", use_fast=False)
21
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
22
+ set_seed(123)
23
+
24
+ text = generator(
25
+ "おはようございます、今日の天気は",
26
+ max_length=30,
27
+ do_sample=True,
28
+ pad_token_id=tokenizer.pad_token_id,
29
+ num_return_sequences=3,
30
+ )
31
+
32
+ for t in text:
33
+ print(t)
34
+
35
+ ```
36
+
37
+ ## Configuration
38
+
39
+ | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
40
+ | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
41
+ | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
42
+ | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
43
+ | 70B (TBA)| | | | | | |
44
+
45
+ ## Training Corpus
46
+
47
+ For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
48
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
49
+ After cleaning, our Japanese training data contains about 1T tokens.
50
+
51
+ For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
52
+
53
+ ## Tokenization
54
+
55
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
56
+ We do not apply pre-tokenization with Japanese tokenizer.
57
+ Thus, a user may directly feed raw sentences into the tokenizer.
58
+
59
+
60
+ ## Ethical Considerations and Limitations
61
+ Sarashina2 has not been tuned to follow an instruction yet.
62
+ Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
63
+ Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
64
+
65
+ ## License
66
+
67
  [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)