alexmourachko commited on
Commit
2209e43
1 Parent(s): 1bfa5cb

update readme

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md CHANGED
@@ -1,3 +1,121 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # SONAR
6
+ [[Paper]]()
7
+ [[Demo]](#usage)
8
+
9
+ We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single **text encoder, covering 200 languages**, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.
10
+
11
+ Speech segments can be embedded in the same \sonar embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks.
12
+ We also provide a **text decoder for 200 languages**, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.
13
+
14
+ Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
15
+
16
+
17
+ Model inference support thanks [Fairseq2](https://github.com/facebookresearch/fairseq2)
18
+
19
+
20
+ ## Installing
21
+
22
+ See our github [repo](https://reimagined-broccoli-941276ee.pages.github.io/nightly/installation/from_source_conda)
23
+
24
+ ## Usage
25
+ Compute text sentence embeddings:
26
+ ```python
27
+ from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
28
+ t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
29
+ tokenizer="text_sonar_basic_encoder")
30
+ sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
31
+ t2vec_model.predict(sentences, source_lang="eng_Latn").shape
32
+ # torch.Size([2, 1024])
33
+ ```
34
+
35
+ Translate with SONAR
36
+ ```python
37
+ from sonar.inference_pipelines.text import TextToTextModelPipeline
38
+ t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
39
+ decoder="text_sonar_basic_decoder",
40
+ tokenizer="text_sonar_basic_encoder") # tokenizer is attached to both encoder and decoder cards
41
+
42
+ sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
43
+ t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
44
+ # ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]
45
+ ```
46
+
47
+ Compute speech sentence embeddings:
48
+ ```python
49
+ import torch
50
+ from sonar.inference_pipelines.speech import SpeechToEmbeddingPipeline, SpeechInferenceParams
51
+
52
+ speech_embedding_dp_builder = SpeechToEmbeddingPipeline.load_from_name("sonar_speech_encoder_eng")
53
+
54
+ speech_ctx = SpeechInferenceParams(
55
+ data_file="..../test_fleurs_fra-eng.tsv",
56
+ audio_root_dir=".../audio_zips",
57
+ audio_path_index=2,
58
+ batch_size=4,
59
+ )
60
+
61
+ speech_embedding_dp = speech_embedding_dp_builder.build_pipeline(speech_ctx)
62
+ with torch.inference_mode():
63
+ speech_emb = next(iter(speech_embedding_dp))
64
+ speech_emb["audio"]["data"].sentence_embeddings
65
+ ```
66
+
67
+
68
+ Speech-to-text with SONAR
69
+ ```python
70
+ import torch
71
+ from sonar.inference_pipelines import SpeechToTextPipeline, SpeechInferenceParams
72
+
73
+ speech_to_text_dp_builder = SpeechToTextPipeline.load_from_name(encoder_name="sonar_speech_encoder_eng",
74
+ decoder_name="text_sonar_basic_decoder")
75
+
76
+ speech_ctx = SpeechInferenceParams(
77
+ data_file=".../test_fleurs_fra-eng.tsv",
78
+ audio_root_dir=".../audio_zips",
79
+ audio_path_index=2,
80
+ target_lang='fra_Latn',
81
+ batch_size=4,
82
+ )
83
+ speech_to_text_dp = speech_to_text_dp_builder.build_pipeline(speech_ctx)
84
+ with torch.inference_mode():
85
+ speech_text_translation = next(iter(speech_to_text_dp))
86
+ speech_text_translation
87
+ ```
88
+
89
+ Predicting [cross-lingual semantic similarity](https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/human_XSTS_eval)
90
+ with BLASER-2 models
91
+ ```Python
92
+ import torch
93
+ from sonar.models.blaser.loader import load_blaser_model
94
+
95
+ blaser_ref = load_blaser_model("blaser_st2st_ref_v2_0").eval()
96
+ blaser_qe = load_blaser_model("blaser_st2st_qe_v2_0").eval()
97
+ # BLASER-2 is supposed to work with SONAR speech and text embeddings,
98
+ # but we didn't include their extraction in this snippet, to keep it simple.
99
+ emb = torch.ones([1, 1024])
100
+ print(blaser_ref(src=emb, ref=emb, mt=emb).item()) # 5.2552
101
+ print(blaser_qe(src=emb, mt=emb).item()) # 4.9819
102
+ ```
103
+
104
+ See more complete demo notebooks :
105
+ * [sonar text2text similarity and translation](examples/sonar_text_demo.ipynb)
106
+ * [sonar speech2text and other data pipeline examples](examples/inference_pipelines.ipynb)
107
+
108
+
109
+ ## Model details
110
+
111
+ - **Developed by:** Paul-Ambroise Duquenne et al.
112
+ - **License:** CC-BY-NC 4.0 license
113
+ - **Cite as:**
114
+
115
+ @article{Duquenne:2023:sonar_arxiv,
116
+ author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
117
+ title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
118
+ publisher = {arXiv},
119
+ year = {2023},
120
+ url = {https://arxiv.org/abs/unk},
121
+ }