alex-miller
/

ODABert

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

alex-miller commited on Mar 13, 2024

Commit

6629ddc

·

verified ·

1 Parent(s): 37e2a38

Update README.md

Files changed (1) hide show

README.md +9 -7

README.md CHANGED Viewed

@@ -4,30 +4,32 @@ base_model: bert-base-multilingual-uncased
 tags:
 - generated_from_trainer
 model-index:
-- name: bert-base-multilingual-uncased-finetuned-wiki-crs
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# bert-base-multilingual-uncased-finetuned-wiki-crs
-This model is a fine-tuned version of [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) on the None dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.9961
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
@@ -56,4 +58,4 @@ The following hyperparameters were used during training:
 - Transformers 4.38.2
 - Pytorch 2.0.1
 - Datasets 2.18.0
-- Tokenizers 0.15.2

 tags:
 - generated_from_trainer
 model-index:
+- name: ODABert
   results: []
+datasets:
+- alex-miller/oecd-dac-crs
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# ODABert
+This model is a fine-tuned version of [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) on the [OECD DAC CRS project titles and descriptions](https://huggingface.co/datasets/alex-miller/oecd-dac-crs) dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.9961
 ## Model description
+A 3 epoch fine-tune of BERT base multilingual uncased on development and humanitarian finance project titles and descriptions from the OECD DAC CRS. Vocabulary of the base model was expanded by 1,059 tokens (1% increase) based on the most prevalent tokens in the CRS that were not present in the base model vocabulary.
 ## Intended uses & limitations
+Developed as an experiment to see whether fine-tuning on the CRS would help improve classifier models built on top of this MLM. Although it's built on a multilingual model, an the finetuning texts do include other languages, English will be the most prevalent.
 ## Training and evaluation data
+See the [OECD DAC CRS project titles and descriptions](https://huggingface.co/datasets/alex-miller/oecd-dac-crs) dataset.
 ## Training procedure
 - Transformers 4.38.2
 - Pytorch 2.0.1
 - Datasets 2.18.0
+- Tokenizers 0.15.2