chgrdj commited on
Commit
ea60b7e
·
verified ·
1 Parent(s): c998c4e

Upload 9 files

Browse files
URLGuardian/README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - distilbert/distilbert-base-multilingual-cased
7
+ pipeline_tag: text-classification
8
+ library_name: transformers
9
+ tags:
10
+ - code
11
+ - cyber
12
+ ---
13
+
14
+
15
+ # Transformer
16
+
17
+ This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ - **Developed by:** Anvilogic
24
+ - **Model Type:** Transformer
25
+ - **Maximum Sequence Length:** 512 tokens
26
+ - **Output Dimensionality:** 768 tokens
27
+ - **Finetuned from model:** [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
28
+ - **Language(s) (NLP):** Multilingual
29
+ - **License:** MIT
30
+
31
+ ### Full Model Architecture
32
+
33
+ ```
34
+ DistilBERT:
35
+ name: "distilbert-base-cased"
36
+ params:
37
+ layers: 6
38
+ hidden_size: 768
39
+ attention_heads: 12
40
+ ff_dim: 3072
41
+ max_seq_len: 512
42
+ vocab_size: 28996
43
+ total_params: 66M
44
+ activation: "gelu"
45
+ ```
46
+
47
+ ## Usage
48
+
49
+ ### Direct Usage
50
+
51
+ First install the Transformers library:
52
+
53
+ ```bash
54
+ pip install -U transformers
55
+ ```
56
+ Then you can load this model and run inference.
57
+ ```python
58
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
59
+ import torch
60
+ # Load pre-trained model and tokenizer
61
+ model_name = "Anvilogic/URLGuardian"
62
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
63
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust `num_labels` based on your task
64
+ # Example sentences
65
+ sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
66
+ # Tokenize inputs
67
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
68
+ # Run inference
69
+ with torch.no_grad():
70
+ outputs = model(**inputs)
71
+ logits = outputs.logits # Raw predictions
72
+ predictions = torch.argmax(logits, dim=-1) # Convert to class labels
73
+ # Print results
74
+ print(predictions.tolist()) # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)
75
+ ```
76
+ ### Downstream Usage
77
+ This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.
78
+ ## Training Details
79
+
80
+ ### Framework Versions
81
+ - Python: 3.10.14
82
+ - Transformers: 4.49.0
83
+ - PyTorch: 2.2.2
84
+ - Tokenizers: 0.20.3
85
+
86
+ ### Training Data
87
+
88
+ The model was fine-tuned using [Anvilogic/URL-Guardian-Dataset](https://huggingface.co/datasets/Anvilogic/URL-Guardian-Dataset), which contains URL as well as their labels.
89
+ The dataset was filtered and converted to the parquet format for efficient processing.
90
+
91
+ ### Training Procedure
92
+ The model was optimized using [BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)
93
+
94
+ #### Training Hyperparameters
95
+ - **Model Architecture**: encoder fine-tuned from [distilbert](https://huggingface.co/distilbert/distilbert-base-cased)
96
+ - **Batch Size**: 32
97
+ - **Epochs**: 3
98
+ - **Learning Rate**: 2e-5
99
+ - **Warmup Steps**: 100
100
+
101
+
102
+ ## Evaluation
103
+
104
+ In the final evaluation after training, the model achieved the following metrics on the test set:
105
+
106
+ **Binary Classification Evaluator**
107
+ ```json
108
+ Accuracy : 0.9744
109
+ F1 Score : 0.9742
110
+ Precision : 0.9771
111
+ Recall : 0.9712
112
+ Average Precision : 0.9962
113
+ ```
114
+ These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.
URLGuardian/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-cased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_past": true,
17
+ "pad_token_id": 0,
18
+ "problem_type": "single_label_classification",
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.46.0",
25
+ "vocab_size": 28996
26
+ }
URLGuardian/gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
URLGuardian/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ecf4221da48f739893107179765b83e379e77f52b96f27c1dde1419404943cf
3
+ size 263144680
URLGuardian/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
URLGuardian/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
URLGuardian/tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
URLGuardian/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:861efa3a0c559688eebcfd4bfad6184cc0671abca0afaee84a39bdb6e69b3c8d
3
+ size 5240
URLGuardian/vocab.txt ADDED
The diff for this file is too large to render. See raw diff