File size: 2,969 Bytes
e137a77
 
 
 
 
5b5021d
f673fb5
e137a77
 
 
f673fb5
e137a77
f673fb5
de76647
49835ff
5b5021d
e137a77
 
 
 
 
7868cf5
e137a77
 
 
e6ac8bd
 
e137a77
 
 
64b9d93
e137a77
f673fb5
64b9d93
e137a77
6f0e666
f673fb5
bdf2097
f673fb5
 
e6ac8bd
f673fb5
6f0e666
e137a77
 
 
 
 
5f81f8e
 
e137a77
 
 
5f81f8e
e137a77
 
 
 
 
e6ac8bd
 
 
 
 
 
e137a77
 
 
 
 
 
5f81f8e
5b5021d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: mit
base_model: xlm-roberta-base
tags:
- generated_from_trainer
- NER
- crypto
metrics:
- f1
model-index:
- name: xlm-roberta-base-finetuned-ner-crypto
  results: []
widget:
- text: 
    "Didn't I tell you that that was a decent entry point on $PROPHET? If you are in - congrats, Prophet is up 90% in the last 2 weeks and 50% up in the last week alone"
pipeline_tag: token-classification
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# cryptoNER

This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0058
- F1: 0.9970

## Model description

This model is a fine-tuned version of xlm-roberta-base, specializing in Named Entity Recognition (NER) within the cryptocurrency domain. It is optimized to recognize and classify entities such as cryptocurrency TICKER SYMBOL, NAME, and blockscanner ADDRESS within text.

## Intended uses
Designed primarily for NER tasks in the cryptocurrency sector, this model excels in identifying and categorizing  ticker symbol, token name, and blockscanner address in textual content.


## Limitations

Performance may be subpar when the model encounters entities outside its training data or infrequently occurring entities within the cryptocurrency domain. The model might also be susceptible to variations in entity presentation and context.
## Training and evaluation data

The model was trained using a diverse dataset, including artificially generated tweets and ERC20 token metadata fetched through the Covalent API (https://www.covalenthq.com/docs/unified-api/). GPT was employed to generate 500 synthetic tweets tailored for the cryptocurrency domain. The Covalent API was instrumental in obtaining a rich set of 20K+ unique ERC20 token metadata entries, enhancing the model's understanding and recognition of cryptocurrency entities.
## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 6

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0269        | 1.0   | 750  | 0.0080          | 0.9957 |
| 0.0049        | 2.0   | 1500 | 0.0074          | 0.9960 |
| 0.0042        | 3.0   | 2250 | 0.0074          | 0.9965 |
| 0.0034        | 4.0   | 3000 | 0.0058          | 0.9971 |
| 0.0028        | 5.0   | 3750 | 0.0059          | 0.9971 |
| 0.0024        | 6.0   | 4500 | 0.0058          | 0.9970 |


### Framework versions

- Transformers 4.34.1
- Pytorch 2.1.0+cu118
- Datasets 2.14.6
- Tokenizers 0.14.1