Cyrile commited on
Commit
92ed3a6
·
1 Parent(s): 0c53ad0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -0
README.md CHANGED
@@ -1,3 +1,35 @@
1
  ---
2
  license: bigscience-bloom-rail-1.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
+ datasets:
4
+ - ehartford/wizard_vicuna_70k_unfiltered
5
+ - shahules786/orca-chat
6
+ - timdettmers/openassistant-guanaco
7
+ - laion/OIG
8
+ language:
9
+ - fr
10
+ - en
11
+ library_name: transformers
12
+ pipeline_tag: text-generation
13
  ---
14
+
15
+ Bloomz-560m-sft-chat
16
+ --------------------
17
+
18
+ We introduce the Bloomz-560m-sft-chat model, which is a fine-tuning of a Large Language Model (LLM) [bigscience/bloomz-560m](https://huggingface.co/bigscience/bloomz-560m). This model is notable for being pre-trained for a chatbot context and undergoing a transposition from float16 to bfloat16. Therefore, this model serves as a solid starting point for fine-tuning towards other more specific tasks.
19
+
20
+ The model was trained equally on both French and English data, ensuring maximum efficiency for these two languages (and their interactions). Due to the transition from float16 to bfloat16, we do not guarantee the preservation of the original model's multilingual capabilities. However, fine-tuning can restore reasonable performance on other languages.
21
+
22
+ The objective is to pre-train all three models (Bloomz-{560m, 3b, 7b1-mt}-sft-chat) to ensure high-performing, energy-efficient, and fast "foundation" models for inference on "realistic" infrastructures suitable for a business with standard industrial capabilities.
23
+
24
+
25
+ Bloomz, through its license, enables free and flexible industrial use. Its tokenizer has been designed with true multi-lingual context in mind, with a significantly lower token generation per word compared to other LLM models. This capability not only leads to improved performance but also enhanced efficiency during inference by making fewer model calls when generating text with shorter contexts. Here is a table illustrating our points using French as an example, where we tokenized Marcel Proust's longest sentence (823 words):
26
+ ```
27
+ Sans honneur que précaire, sans liberté que provisoire, [...], et de façon qu’à eux-mêmes il ne leur paraisse pas un vice.
28
+ ```
29
+
30
+ | model | GPT 3.5 | Boris | Flan-T5 | LLaMA | Dolly | MPT | Falcon | Bloomz |
31
+ |:--------------:|:-------:|:-----:|:-------:|:-----:|:-----:|:---:|:------:|:------:|
32
+ | tokens by word | 2.3 | 2.3 | 2 | 1.9 | 1.9 | 1.9 | 1.8 | 1.4 |
33
+
34
+
35
+ For comparison, with a specialized French tokenizer like [CamemBERT](https://huggingface.co/camembert/camembert-base) or [DistilCamemBERT](cmarkea/distilcamembert-base), we have 1.5 tokens per word. In addition to its positive impact on inference time and resource consumption, there has already been a demonstrated direct relationship between the number of tokens per word required for modeling and the predictive performance of the model [1].