anastasiastasenko
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -72,6 +72,18 @@ For best results we recommend the following setting:
|
|
72 |
* Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
|
73 |
* Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
## Future updates
|
76 |
Pleias-Pico will be continuously improved through iterative retraining/adaptation.
|
77 |
|
|
|
72 |
* Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
|
73 |
* Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
|
74 |
|
75 |
+
Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 4 tons CO2eq for training.
|
76 |
+
|
77 |
+
## Ethical Considerations
|
78 |
+
|
79 |
+
pleias-pico model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.
|
80 |
+
|
81 |
+
To address this, we implemented a systematic ethical filtering process using toxicity classifiers to identify extremely harmful content. We also employed synthetic rewriting techniques to transform mildly problematic passages while preserving the underlying informational value. This process significantly reduced potential societal harm without compromising the dataset's size or textual quality, resulting in notably low toxicity scores in benchmarks compared to other models.
|
82 |
+
|
83 |
+
Despite these preventive measures, users should be aware that the model has not undergone additional safety alignment procedures and may still produce problematic outputs. The model's capabilities in generative AI tasks must be balanced against the risks of bias, misinformation propagation, and autonomous decision-making challenges. We explicitly prohibit any malicious utilization and emphasize the responsibility of users to implement appropriate safeguards.
|
84 |
+
|
85 |
+
At Pleias, we continue to research and develop improved methods for creating safer and more equitable models and datasets. This includes ongoing work in toxicity reduction, bias mitigation, and the development of more sophisticated ethical filtering techniques.
|
86 |
+
|
87 |
## Future updates
|
88 |
Pleias-Pico will be continuously improved through iterative retraining/adaptation.
|
89 |
|