Youngja Park commited on
Commit
a2cd454
·
verified ·
1 Parent(s): b12679f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -3
README.md CHANGED
@@ -1,3 +1,43 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ - bertscore
8
+ base_model:
9
+ - google-bert/bert-base-uncased
10
+ pipeline_tag: text-classification
11
+ ---
12
+ CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
13
+
14
+ The model was trained on a security text corpus which contains about 1.2 billion words.
15
+ The corpus includes many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
16
+
17
+ The model has shown improved performance for various cybersecurity text classification tasks.
18
+ However, it is not inteded to be used as the main model for general-domain documents.
19
+
20
+ For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
21
+
22
+ #### Model description
23
+ It has a vocabulary of 50,000 tokens and the sequence length of 256.
24
+
25
+ The following hyperparameters were used during training:
26
+ - learning_rate: 0.0005
27
+ - train_batch_size: 128
28
+ - eval_batch_size: 128
29
+ - seed: 42
30
+ - gradient_accumulation_steps: 16
31
+ - total_train_batch_size: 2048
32
+ - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
33
+ - lr_scheduler_type: linear
34
+ - lr_scheduler_warmup_steps: 10000
35
+ - training_steps: 200000
36
+
37
+
38
+ #### Framework versions
39
+
40
+ - Transformers 4.18.0
41
+ - Pytorch 1.12.1+cu102
42
+ - Datasets 2.4.0
43
+ - Tokenizers 0.12.1