nestauk
/

jobbert-base-cased-compdecs

@@ -1,36 +1,59 @@
 ---
 base_model: jjzha/jobbert-base-cased
-tags:
-- generated_from_trainer
 model-index:
 - name: jobbert-base-cased-compdecs
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# jobbert-base-cased-compdecs
-This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.4622
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 2e-05
@@ -41,13 +64,20 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: linear
 - num_epochs: 10
-### Training results
-### Framework versions
 - Transformers 4.32.0
 - Pytorch 2.0.1+cu118
 - Datasets 2.14.4
-- Tokenizers 0.13.3

 ---
 base_model: jjzha/jobbert-base-cased
 model-index:
 - name: jobbert-base-cased-compdecs
   results: []
+license: mit
+language:
+- en
+metrics:
+- accuracy
+pipeline_tag: text-classification
 ---
+## 🖊️ Model description
+This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased). JobBERT is a continuously pre-trained bert-base-cased checkpoint on ~3.2M sentences from job postings.
+It has been fine tuned with a classification head to binarily classify job advert sentences as being a `company description` or not.
+The model was trained on **486 labelled company description sentences** and **1000 non company description sentences less than 250 characters in length.**
+It achieves the following results on a held out test set 147 sentences:
+- Accuracy: 0.92157
+| Label      | precision | recall | f1-score | support |
+| ----------- | ----------- | ----------- |----------- |----------- |
+| not company description      | 0.930693       |0.959184|0.944724|98|
+| company description   | 0.913043        |0.857143|0.884211|49|
+## 🖨️ Use
+To use the model:
+```
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from transformers import pipeline
+model = AutoModelForSequenceClassification.from_pretrained("ihk/jobbert-base-cased-compdecs")
+tokenizer = AutoTokenizer.from_pretrained("ihk/jobbert-base-cased-compdecs")
+comp_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)
+```
+An example use is as follows:
+```
+job_sent = "Would you like to join a major manufacturing company?"
+comp_classifier(job_sent)
+>> [{'label': 'LABEL_1', 'score': 0.9953641891479492}]
+```
+The intended use of this model is to extract company descriptions from online job adverts to use in downstream tasks such as mapping to [Standardised Industrial Classification (SIC)](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) codes.
+### ⚖️ Training hyperparameters
 The following hyperparameters were used during training:
 - learning_rate: 2e-05
 - lr_scheduler_type: linear
 - num_epochs: 10
+### ⚖️ Training results
+The fine-tuning metrics are as follows:
+- eval_loss: 0.462236
+- eval_runtime: 0.629300
+- eval_samples_per_second: 233.582000
+- eval_steps_per_second: 15.890000
+- epoch: 10.000000
+- perplexity: 1.590000
+-
+### ⚖️ Framework versions
 - Transformers 4.32.0
 - Pytorch 2.0.1+cu118
 - Datasets 2.14.4
+- Tokenizers 0.13.3