File size: 2,479 Bytes
d4ae66c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45cdaf9
d4ae66c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45cdaf9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
datasets:
- ealvaradob/phishing-dataset
language:
- en
base_model:
- CrabInHoney/urlbert-tiny-base-v3
pipeline_tag: text-classification
tags:
- url
- urls
- links
- classification
- tiny
- phishing
- urlbert
license: apache-2.0
---
This is a very small version of BERT, designed to categorize links into phishing and non-phishing links

An updated, lighter version of the old classification model for URL analysis

Old version: https://huggingface.co/CrabInHoney/urlbert-tiny-v2-phishing-classifier
##### Comparison with the previous version of urlbert phishing-classifier:

| Version  | Accuracy  | Precision  | Recall  |  F1-score |
| ------------ | ------------ | ------------ | ------------ | ------------ |
|  v2 |  0.9665 |  0.9756 |  0.9522 | 0.9637  |
| **v3** | **0.9819**  |  **0.9876** | **0.9734**  | **0.9805** |


Model size

3.69M params

Tensor type

F32

[Dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset "Dataset")
(urls.json only)

Example:



    from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
    import torch
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Используемое устройство: {device}")
    
    model_name = "CrabInHoney/urlbert-tiny-v3-phishing-classifier"
    
    tokenizer = BertTokenizerFast.from_pretrained(model_name)
    model = BertForSequenceClassification.from_pretrained(model_name)
    model.to(device)
    
    classifier = pipeline(
        "text-classification",
        model=model,
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1,
        return_all_scores=True
    )
    
    test_urls = [
        "huggingface.co/",
        "hu991ngface.com.ru/"
    ]
    
    label_mapping = {"LABEL_0": "good", "LABEL_1": "fish"}
    
    for url in test_urls:
        results = classifier(url)
        print(f"\nURL: {url}")
        for result in results[0]: 
            label = result['label']
            score = result['score']
            friendly_label = label_mapping.get(label, label)
            print(f"Класс: {friendly_label}, вероятность: {score:.4f}")


Используемое устройство: cuda

URL: huggingface.co/

Класс: good, вероятность: 0.9723

Класс: fish, вероятность: 0.0277

URL: hu991ngface.com.ru/

Класс: good, вероятность: 0.0070

Класс: fish, вероятность: 0.9930