--- license: llama2 language: - en library_name: CTranslate2 pipeline_tag: text-generation tags: - facebook - meta - wizardlm - llama - llama-2 - ct2 - quantized model - int8 --- # CTranslate2 int8 version of WizardLM-13B-V1.2 This is a int8_float16 quantization of [WizardLM-13B-V1.2](https://huggingface.co/WizardLM/WizardLM-13B-V1.2)\ See more on CTranslate2: [Docs](https://opennmt.net/CTranslate2/index.html) | [Github](https://github.com/OpenNMT/CTranslate2) This model was converted to ct2 format using the following commnd: ``` ct2-transformers-converter --model WizardLM/WizardLM-13B-V1.2 --copy_files tokenizer.model --output_dir wizard13b --quantization int8_float16 --low_cpu_mem_usage ``` To convert this model, edits had to be made to the file: **added_tokens.json** From: ``` { "": 32000 } ``` To: ``` { } ``` ***no converstion needed using the model from this repository as it is already in ct2 format.*** --- ## From the CTranslate2 GitHub (no relation to this model): CTranslate2 is a C++ and Python library for efficient inference with Transformer models. ### CTranslate2 performance We translate the En->De test set *newstest2014* with multiple models: * [OpenNMT-tf WMT14](https://opennmt.net/Models-tf/#translation): a base Transformer trained with OpenNMT-tf on the WMT14 dataset (4.5M lines) * [OpenNMT-py WMT14](https://opennmt.net/Models-py/#translation): a base Transformer trained with OpenNMT-py on the WMT14 dataset (4.5M lines) * [OPUS-MT](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/en-de#opus-2020-02-26zip): a base Transformer trained with Marian on all OPUS data available on 2020-02-26 (81.9M lines) The benchmark reports the number of target tokens generated per second (higher is better). The results are aggregated over multiple runs. See the [benchmark scripts](tools/benchmark) for more details and reproduce these numbers. **Please note that the results presented below are only valid for the configuration used during this benchmark: absolute and relative performance may change with different settings.** #### CPU | | Tokens per second | Max. memory | BLEU | | --- | --- | --- | --- | | **OpenNMT-tf WMT14 model** | | | | | OpenNMT-tf 2.31.0 (with TensorFlow 2.11.0) | 209.2 | 2653MB | 26.93 | | **OpenNMT-py WMT14 model** | | | | | OpenNMT-py 3.0.4 (with PyTorch 1.13.1) | 275.8 | 2012MB | 26.77 | | - int8 | 323.3 | 1359MB | 26.72 | | CTranslate2 3.6.0 | 658.8 | 849MB | 26.77 | | - int16 | 733.0 | 672MB | 26.82 | | - int8 | 860.2 | 529MB | 26.78 | | - int8 + vmap | 1126.2 | 598MB | 26.64 | | **OPUS-MT model** | | | | | Transformers 4.26.1 (with PyTorch 1.13.1) | 147.3 | 2332MB | 27.90 | | Marian 1.11.0 | 344.5 | 7605MB | 27.93 | | - int16 | 330.2 | 5901MB | 27.65 | | - int8 | 355.8 | 4763MB | 27.27 | | CTranslate2 3.6.0 | 525.0 | 721MB | 27.92 | | - int16 | 596.1 | 660MB | 27.53 | | - int8 | 696.1 | 516MB | 27.65 | Executed with 4 threads on a [*c5.2xlarge*](https://aws.amazon.com/ec2/instance-types/c5/) Amazon EC2 instance equipped with an Intel(R) Xeon(R) Platinum 8275CL CPU. #### GPU | | Tokens per second | Max. GPU memory | Max. CPU memory | BLEU | | --- | --- | --- | --- | --- | | **OpenNMT-tf WMT14 model** | | | | | | OpenNMT-tf 2.31.0 (with TensorFlow 2.11.0) | 1483.5 | 3031MB | 3122MB | 26.94 | | **OpenNMT-py WMT14 model** | | | | | | OpenNMT-py 3.0.4 (with PyTorch 1.13.1) | 1795.2 | 2973MB | 3099MB | 26.77 | | FasterTransformer 5.3 | 6979.0 | 2402MB | 1131MB | 26.77 | | - float16 | 8592.5 | 1360MB | 1135MB | 26.80 | | CTranslate2 3.6.0 | 6634.7 | 1261MB | 953MB | 26.77 | | - int8 | 8567.2 | 1005MB | 807MB | 26.85 | | - float16 | 10990.7 | 941MB | 807MB | 26.77 | | - int8 + float16 | 8725.4 | 813MB | 800MB | 26.83 | | **OPUS-MT model** | | | | | | Transformers 4.26.1 (with PyTorch 1.13.1) | 1022.9 | 4097MB | 2109MB | 27.90 | | Marian 1.11.0 | 3241.0 | 3381MB | 2156MB | 27.92 | | - float16 | 3962.4 | 3239MB | 1976MB | 27.94 | | CTranslate2 3.6.0 | 5876.4 | 1197MB | 754MB | 27.92 | | - int8 | 7521.9 | 1005MB | 792MB | 27.79 | | - float16 | 9296.7 | 909MB | 814MB | 27.90 | | - int8 + float16 | 8362.7 | 813MB | 766MB | 27.90 | Executed with CUDA 11 on a [*g5.xlarge*](https://aws.amazon.com/ec2/instance-types/g5/) Amazon EC2 instance equipped with a NVIDIA A10G GPU (driver version: 510.47.03).