File size: 3,113 Bytes
4296d0b 72e8594 4296d0b 933f65e f5ef8b4 933f65e 71d67eb 933f65e 58e4034 933f65e 71d67eb 933f65e 6ac670f 933f65e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: other
license_name: ntt-license
license_link: LICENSE
language:
- ja
- en
pipeline_tag: translation
library_name: fairseq
tags:
- nmt
---
# Sugoi v4 JPN->ENG NMT Model by MingShiba
- https://sugoitoolkit.com
- https://blog.sugoitoolkit.com
- https://www.patreon.com/mingshiba
## How to download this model using python
- Install Python https://www.python.org/downloads/
- `cmd`
- `python --version`
- `python -m pip install huggingface_hub`
- `python`
```
import huggingface_hub
huggingface_hub.download_snapshot('entai2965/sugoi-v4-ja-en-ctranslate2',local_dir='sugoi-v4-ja-en-ctranslate2')
```
## How to run this model (batch syntax)
- https://opennmt.net/CTranslate2/guides/fairseq.html#fairseq
- `cmd`
- `python -m pip install ctranslate2 sentencepiece`
- `python`
```
import ctranslate2
import sentencepiece
#set defaults
model_path='sugoi-v4-ja-en-ctranslate2'
sentencepiece_model_path=model_path+'/spm'
device='cpu'
#device='cuda'
#load data
string1='γ―ιγγ«εγΈγ¨ζ©γΏεΊγγ'
string2='ζ²γγGPTγ¨θ©±γγγγ¨γγγγΎγγ?'
raw_list=[string1,string2]
#load models
translator = ctranslate2.Translator(model_path, device=device)
tokenizer_for_source_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.ja.nopretok.model')
tokenizer_for_target_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.en.nopretok.model')
#tokenize batch
tokenized_batch=[]
for text in raw_list:
tokenized_batch.append(tokenizer_for_source_language.encode(text,out_type=str))
#translate
#https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_batch=translator.translate_batch(source=tokenized_batch,beam_size=5)
assert(len(raw_list)==len(translated_batch))
#decode
for count,tokens in enumerate(translated_batch):
translated_batch[count]=tokenizer_for_target_language.decode(tokens.hypotheses[0]).replace('<unk>','')
#output
for text in translated_batch:
print(text)
```
[Functional programming](https://docs.python.org/3/howto/functional.html) version
```
import ctranslate2
import sentencepiece
#set defaults
model_path='sugoi-v4-ja-en-ctranslate2'
sentencepiece_model_path=model_path+'/spm'
device='cpu'
#device='cuda'
#load data
string1='γ―ιγγ«εγΈγ¨ζ©γΏεΊγγ'
string2='ζ²γγGPTγ¨θ©±γγγγ¨γγγγΎγγ?'
raw_list=[string1,string2]
#load models
translator = ctranslate2.Translator(model_path, device=device)
tokenizer_for_source_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.ja.nopretok.model')
tokenizer_for_target_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.en.nopretok.model')
#invoke black magic
translated_batch=[tokenizer_for_target_language.decode(tokens.hypotheses[0]).replace('<unk>','') for tokens in translator.translate_batch(source=[tokenizer_for_source_language.encode(text,out_type=str) for text in raw_list],beam_size=5)]
assert(len(raw_list)==len(translated_batch))
#output
for text in translated_batch:
print(text)
```
|