File size: 3,113 Bytes
4296d0b
 
 
 
 
 
72e8594
4296d0b
 
 
 
 
 
933f65e
f5ef8b4
933f65e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71d67eb
 
 
 
 
933f65e
 
 
 
 
 
 
 
 
 
 
58e4034
933f65e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71d67eb
 
 
 
 
933f65e
 
 
 
 
6ac670f
933f65e
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: other
license_name: ntt-license
license_link: LICENSE
language:
- ja
- en
pipeline_tag: translation
library_name: fairseq
tags:
- nmt
---

# Sugoi v4 JPN->ENG NMT Model by MingShiba

- https://sugoitoolkit.com
- https://blog.sugoitoolkit.com
- https://www.patreon.com/mingshiba

## How to download this model using python
- Install Python https://www.python.org/downloads/
- `cmd`
- `python --version`
- `python -m pip install huggingface_hub`
- `python`

```
import huggingface_hub
huggingface_hub.download_snapshot('entai2965/sugoi-v4-ja-en-ctranslate2',local_dir='sugoi-v4-ja-en-ctranslate2')
```

## How to run this model (batch syntax)
- https://opennmt.net/CTranslate2/guides/fairseq.html#fairseq
- `cmd`
- `python -m pip install ctranslate2 sentencepiece`
- `python`
```
import ctranslate2
import sentencepiece

#set defaults
model_path='sugoi-v4-ja-en-ctranslate2'
sentencepiece_model_path=model_path+'/spm'

device='cpu'
#device='cuda'

#load data
string1='γ―ι™γ‹γ«ε‰γΈγ¨ζ­©γΏε‡ΊγŸγ€‚'
string2='悲しいGPTγ¨θ©±γ—γŸγ“γ¨γŒγ‚γ‚ŠγΎγ™γ‹?'
raw_list=[string1,string2]

#load models
translator = ctranslate2.Translator(model_path, device=device)
tokenizer_for_source_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.ja.nopretok.model')
tokenizer_for_target_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.en.nopretok.model')

#tokenize batch
tokenized_batch=[]
for text in raw_list:
    tokenized_batch.append(tokenizer_for_source_language.encode(text,out_type=str))

#translate
#https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_batch=translator.translate_batch(source=tokenized_batch,beam_size=5)
assert(len(raw_list)==len(translated_batch))

#decode
for count,tokens in enumerate(translated_batch):
    translated_batch[count]=tokenizer_for_target_language.decode(tokens.hypotheses[0]).replace('<unk>','')

#output
for text in translated_batch:
    print(text)
```

[Functional programming](https://docs.python.org/3/howto/functional.html) version

```
import ctranslate2
import sentencepiece

#set defaults
model_path='sugoi-v4-ja-en-ctranslate2'
sentencepiece_model_path=model_path+'/spm'

device='cpu'
#device='cuda'

#load data
string1='γ―ι™γ‹γ«ε‰γΈγ¨ζ­©γΏε‡ΊγŸγ€‚'
string2='悲しいGPTγ¨θ©±γ—γŸγ“γ¨γŒγ‚γ‚ŠγΎγ™γ‹?'
raw_list=[string1,string2]

#load models
translator = ctranslate2.Translator(model_path, device=device)
tokenizer_for_source_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.ja.nopretok.model')
tokenizer_for_target_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.en.nopretok.model')

#invoke black magic
translated_batch=[tokenizer_for_target_language.decode(tokens.hypotheses[0]).replace('<unk>','') for tokens in translator.translate_batch(source=[tokenizer_for_source_language.encode(text,out_type=str) for text in raw_list],beam_size=5)]
assert(len(raw_list)==len(translated_batch))

#output
for text in translated_batch:
    print(text)
```