# LASER: calculation of sentence embeddings Tool to calculate sentence embeddings for an arbitrary text file: ``` bash ./embed.sh INPUT-FILE OUTPUT-FILE [LANGUAGE] ``` The input will first be tokenized, and then sentence embeddings will be generated. If a `language` is specified, then `embed.sh` will look for a language-specific LASER3 encoder using the format: `{model_dir}/laser3-{language}.{version}.pt`. Otherwise it will default to LASER2 which covers the same 93 languages as [the original LASER encoder](https://arxiv.org/pdf/1812.10464.pdf). **NOTE:** please set the model location (`model_dir` in `embed.sh`) before running. We recommend to download the models from the NLLB release (see [here](/nllb/README.md)). Optionally you can also select the model version number for downloaded LASER3 models. This currently defaults to: `1` (initial release). ## Output format The embeddings are stored in float32 matrices in raw binary format. They can be read in Python by: ``` import numpy as np dim = 1024 X = np.fromfile("my_embeddings.bin", dtype=np.float32, count=-1) X.resize(X.shape[0] // dim, dim) ``` X is a N x 1024 matrix where N is the number of lines in the text file. ## Examples In order to encode an input text in any of the 93 languages supported by LASER2 (e.g. Afrikaans, English, French): ``` ./embed.sh input_file output_file ``` To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Irish: ``` ./embed.sh input_file output_file wol_Latn ``` ``` ./embed.sh input_file output_file hau_Latn ``` ``` ./embed.sh input_file output_file gle_Latn ```