ibm-research
/

CTI-BERT

Generated from Trainer

Model card Files Files and versions Community

Youngja Park commited on Jan 17

Commit

2472d44

·

verified ·

1 Parent(s): a2cd454

Update README.md

Files changed (1) hide show

README.md +14 -7

README.md CHANGED Viewed

@@ -5,19 +5,18 @@ language:
 metrics:
 - accuracy
 - bertscore
-base_model:
-- google-bert/bert-base-uncased
 pipeline_tag: text-classification
 ---
 CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
-The model was trained on a security text corpus which contains about 1.2 billion words.
-The corpus includes many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
-The model has shown improved performance for various cybersecurity text classification tasks.
-However, it is not inteded to be used as the main model for general-domain documents.
-For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
 #### Model description
 It has a vocabulary of 50,000 tokens and the sequence length of 256.
@@ -41,3 +40,11 @@ The following hyperparameters were used during training:
 - Pytorch 1.12.1+cu102
 - Datasets 2.4.0
 - Tokenizers 0.12.1

 metrics:
 - accuracy
 - bertscore
 pipeline_tag: text-classification
 ---
 CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
+For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
+### Training
+The model was trained on a security text corpus which contains about 1.2 billion words from
+many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
+The model was pretrained using [the run_mlm script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) with the MLM (masked language modeling) objective.
 #### Model description
 It has a vocabulary of 50,000 tokens and the sequence length of 256.
 - Pytorch 1.12.1+cu102
 - Datasets 2.4.0
 - Tokenizers 0.12.1
+### Intended uses & limitations
+You can use the raw model for either masked language modeling or token embedding generation, but it's mostly intended to be fine-tuned on a downstream task,
+such as sequence classification (NER), text classification or question answering.
+The model has shown improved performance for various cybersecurity-domain tasks.
+However, it is not inteded to be used as the main model for general-domain documents.