Youngja Park
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -5,19 +5,18 @@ language:
|
|
5 |
metrics:
|
6 |
- accuracy
|
7 |
- bertscore
|
8 |
-
base_model:
|
9 |
-
- google-bert/bert-base-uncased
|
10 |
pipeline_tag: text-classification
|
11 |
---
|
12 |
CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
|
|
|
13 |
|
14 |
-
The model was trained on a security text corpus which contains about 1.2 billion words.
|
15 |
-
The corpus includes many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
|
16 |
|
17 |
-
|
18 |
-
|
|
|
|
|
|
|
19 |
|
20 |
-
For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
|
21 |
|
22 |
#### Model description
|
23 |
It has a vocabulary of 50,000 tokens and the sequence length of 256.
|
@@ -41,3 +40,11 @@ The following hyperparameters were used during training:
|
|
41 |
- Pytorch 1.12.1+cu102
|
42 |
- Datasets 2.4.0
|
43 |
- Tokenizers 0.12.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
metrics:
|
6 |
- accuracy
|
7 |
- bertscore
|
|
|
|
|
8 |
pipeline_tag: text-classification
|
9 |
---
|
10 |
CTI-BERT is a pre-trained BERT model for the cybersecurity domain, especially for cyber-threat intelligence extraction and understanding.
|
11 |
+
For more details, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
|
12 |
|
|
|
|
|
13 |
|
14 |
+
### Training
|
15 |
+
|
16 |
+
The model was trained on a security text corpus which contains about 1.2 billion words from
|
17 |
+
many security news, vulnerability descriptions, books, academic publications, Wikipedia pages, etc.
|
18 |
+
The model was pretrained using [the run_mlm script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) with the MLM (masked language modeling) objective.
|
19 |
|
|
|
20 |
|
21 |
#### Model description
|
22 |
It has a vocabulary of 50,000 tokens and the sequence length of 256.
|
|
|
40 |
- Pytorch 1.12.1+cu102
|
41 |
- Datasets 2.4.0
|
42 |
- Tokenizers 0.12.1
|
43 |
+
|
44 |
+
### Intended uses & limitations
|
45 |
+
|
46 |
+
You can use the raw model for either masked language modeling or token embedding generation, but it's mostly intended to be fine-tuned on a downstream task,
|
47 |
+
such as sequence classification (NER), text classification or question answering.
|
48 |
+
|
49 |
+
The model has shown improved performance for various cybersecurity-domain tasks.
|
50 |
+
However, it is not inteded to be used as the main model for general-domain documents.
|