CrabInHoney commited on
Commit
ece8f40
·
verified ·
1 Parent(s): 5c61ae3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - CrabInHoney/urlbert-tiny-base-v3
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - url
9
+ - cybersecurity
10
+ - urls
11
+ - links
12
+ - classification
13
+ - phishing-detection
14
+ - tiny
15
+ - phishing
16
+ - malware
17
+ - defacement
18
+ - transformers
19
+ - urlbert
20
+ - bert
21
+ - malicious
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # URLBERT-Tiny-v3 Malicious URL Classifier
26
+
27
+ This is a lightweight version of BERT, specifically fine-tuned for classifying URLs into four categories: benign, phishing, malware, and defacement.
28
+
29
+ ## Model Details
30
+
31
+ - **Model size**: 3.69M parameters
32
+ - **Tensor type**: F32
33
+ - **Model weight size**: 14.8 MB
34
+ - **Base model**: [CrabInHoney/urlbert-tiny-base-v3](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v3)
35
+ - **Dataset**: [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
36
+
37
+ ## Model Evaluation Results
38
+
39
+ The model was evaluated on a test set with the following classification metrics:
40
+
41
+ | Class | Precision | Recall | F1-Score |
42
+ |--------------|------------|------------|------------|
43
+ | Benign | 0.987695 | 0.993717 | 0.990697 |
44
+ | Defacement | 0.988510 | 0.998963 | 0.993709 |
45
+ | Malware | 0.988291 | 0.960332 | 0.974111 |
46
+ | Phishing | 0.958425 | 0.930826 | 0.944423 |
47
+ | **Accuracy** | 0.983738 | 0.983738 | 0.983738 |
48
+ | **Macro Avg**| 0.980730 | 0.970959 | 0.975735 |
49
+ | **Weighted Avg** | 0.983615 | 0.983738 | 0.983627 |
50
+
51
+ ## Usage Example
52
+
53
+ Below is an example of how to use the model for URL classification using the Hugging Face `transformers` library:
54
+
55
+ ```python
56
+ from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
57
+ import torch
58
+
59
+ # Определение устройства (GPU или CPU)
60
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
61
+ print(f"Используемое устройство: {device}")
62
+
63
+ # Загрузка модели и токенизатора
64
+ model_name = "CrabInHoney/urlbert-tiny-v3-malicious-url-classifier"
65
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
66
+ model = BertForSequenceClassification.from_pretrained(model_name)
67
+ model.to(device)
68
+
69
+ # Создание pipeline для классификации
70
+ classifier = pipeline(
71
+ "text-classification",
72
+ model=model,
73
+ tokenizer=tokenizer,
74
+ device=0 if torch.cuda.is_available() else -1,
75
+ return_all_scores=True
76
+ )
77
+
78
+ # Примеры URL для тестирования
79
+ test_urls = [
80
+ "wikiobits.com/Obits/TonyProudfoot",
81
+ "http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb",
82
+ ]
83
+
84
+ # Маппинг меток на понятные названия классов
85
+ label_mapping = {
86
+ "LABEL_0": "benign",
87
+ "LABEL_1": "defacement",
88
+ "LABEL_2": "malware",
89
+ "LABEL_3": "phishing"
90
+ }
91
+
92
+ # Классификация URL
93
+ for url in test_urls:
94
+ results = classifier(url)
95
+ print(f"\nURL: {url}")
96
+ for result in results[0]:
97
+ label = result['label']
98
+ score = result['score']
99
+ friendly_label = label_mapping.get(label, label)
100
+ print(f"Класс: {friendly_label}, вероятность: {score:.4f}")
101
+ ```
102
+
103
+ ### Example Output:
104
+ ```
105
+ URL: wikiobits.com/Obits/TonyProudfoot
106
+ Класс: benign, вероятность: 0.9953
107
+ Класс: defacement, вероятность: 0.0000
108
+ Класс: malware, вероятность: 0.0000
109
+ Класс: phishing, вероятность: 0.0046
110
+
111
+ URL: http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb
112
+ Класс: benign, вероятность: 0.0000
113
+ Класс: defacement, вероятность: 0.0001
114
+ Класс: malware, вероятность: 0.9998
115
+ Класс: phishing, вероятность: 0.0001
116
+ ```