Youngja Park commited on
Commit
2472d44
·
verified ·
1 Parent(s): a2cd454

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -7
README.md CHANGED
@@ -5,19 +5,18 @@ language:
5
  metrics:
6
  - accuracy
7
  - bertscore
8
- base_model:
9
- - google-bert/bert-base-uncased
10
  pipeline_tag: text-classification
11
  ---
12
  CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
 
13
 
14
- The model was trained on a security text corpus which contains about 1.2 billion words.
15
- The corpus includes many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
16
 
17
- The model has shown improved performance for various cybersecurity text classification tasks.
18
- However, it is not inteded to be used as the main model for general-domain documents.
 
 
 
19
 
20
- For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
21
 
22
  #### Model description
23
  It has a vocabulary of 50,000 tokens and the sequence length of 256.
@@ -41,3 +40,11 @@ The following hyperparameters were used during training:
41
  - Pytorch 1.12.1+cu102
42
  - Datasets 2.4.0
43
  - Tokenizers 0.12.1
 
 
 
 
 
 
 
 
 
5
  metrics:
6
  - accuracy
7
  - bertscore
 
 
8
  pipeline_tag: text-classification
9
  ---
10
  CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
11
+ For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
12
 
 
 
13
 
14
+ ### Training
15
+
16
+ The model was trained on a security text corpus which contains about 1.2 billion words from
17
+ many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
18
+ The model was pretrained using [the run_mlm script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) with the MLM (masked language modeling) objective.
19
 
 
20
 
21
  #### Model description
22
  It has a vocabulary of 50,000 tokens and the sequence length of 256.
 
40
  - Pytorch 1.12.1+cu102
41
  - Datasets 2.4.0
42
  - Tokenizers 0.12.1
43
+
44
+ ### Intended uses & limitations
45
+
46
+ You can use the raw model for either masked language modeling or token embedding generation, but it's mostly intended to be fine-tuned on a downstream task,
47
+ such as sequence classification (NER), text classification or question answering.
48
+
49
+ The model has shown improved performance for various cybersecurity-domain tasks.
50
+ However, it is not inteded to be used as the main model for general-domain documents.