peralp24 commited on
Commit
f795f86
1 Parent(s): 1b1d4ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -23
README.md CHANGED
@@ -338,26 +338,4 @@ for downstream use and multilinguality.
338
 
339
  ### Tokenization
340
 
341
- Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library.
342
- The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
343
- * Split whole number tokens (e.g. 12345 ) into individual digit tokens
344
- * Remove double spaces: removes the tokens which contains " " in the token
345
- * Remove tokens that contain zero-width space (except itself)
346
- * Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
347
- * Remove any token that contains “\n” and is not either "\n", "\r".
348
-
349
- ### Tokenizer fertility
350
-
351
- Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
352
- represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
353
- same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
354
- than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
355
- Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
356
- therefore more cost-efficient at inference time.
357
-
358
- |Tokenizer fertility |Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
359
- |--|--|--|--|
360
- |de|2.011|2.546|2.241|
361
- |fr|1.896|2.105|1.836|
362
- |es|1.673|2.030|1.749|
363
- |en|1.633|1.681|1.410|
 
338
 
339
  ### Tokenization
340
 
341
+ Tokenization taking place in this embedding model takes full advantage of the one in [Pharia-1-LLM-7B-control model](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control)