nlp-thedeep
/

humbert

Model card Files Files and versions Community

nlp-thedeep commited on Jan 16, 2023

Commit

bc1988f

·

1 Parent(s): ef5ed2c

Update README.md

Files changed (1) hide show

README.md +33 -1

README.md CHANGED Viewed

@@ -4,4 +4,36 @@ language:
 - en
 - fr
 - es
----

 - en
 - fr
 - es
+- multilingual
+---
+# HumBert
+HumBert is a [XLM-Roberta](https://huggingface.co/xlm-roberta-base) model trained on humanitarian texts - approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles.
+Data were collected from three main sources: [Reliefweb](https://reliefweb.int/), [UNHCR Refworld](https://www.refworld.org/) and [Europe Media Monitor News Brief](https://emm.newsbrief.eu/).
+Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of  humanitarian data in other languages.
+## Intended uses & limitations
+To the best of our knowledge, HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as dister responses text classification) more effective.
+This model is primarily aimed at being fine-tuned on tasks such as sequence classification or token classification.
+## Benchmarks
+Soon...
+## Usage
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained('nlp-thedeep/humbert')
+model = AutoModelForMaskedLM.from_pretrained("nlp-thedeep/humbert")
+# prepare input
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+# forward pass
+output = model(**encoded_input)
+```