|
--- |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: EUBERT |
|
results: [] |
|
language: |
|
- bg |
|
- cs |
|
- da |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fi |
|
- fr |
|
- ga |
|
- hr |
|
- hu |
|
- it |
|
- lt |
|
- lv |
|
- mt |
|
- nl |
|
- pl |
|
- pt |
|
- ro |
|
- sk |
|
- sl |
|
- sv |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
|
|
## Model Card: EUBERT |
|
|
|
### Overview |
|
|
|
- **Model Name**: EUBERT |
|
- **Model Version**: 1.0 |
|
- **Date of Release**: 02 October 2023 |
|
- **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers) |
|
- **Training Data**: Documents registered by the European Publications Office |
|
- **Model Use Case**: Text Classification, Question Answering, Language Understanding |
|
|
|
![EUBERT](https://huggingface.co/EuropeanParliament/EUBERT/resolve/main/EUBERT_small.png) |
|
|
|
|
|
### Model Description |
|
|
|
EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/). |
|
These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. |
|
EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks, |
|
making it a valuable resource for a variety of applications. |
|
|
|
### Intended Use |
|
|
|
EUBERT serves as a starting point for building more specific natural language understanding models. |
|
Its versatility makes it suitable for a wide range of tasks, including but not limited to: |
|
|
|
1. **Text Classification**: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection. |
|
|
|
2. **Question Answering**: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization. |
|
|
|
3. **Language Understanding**: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation. |
|
|
|
### Performance |
|
|
|
The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning. |
|
Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly. |
|
|
|
### Considerations |
|
|
|
- **Data Privacy and Compliance**: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information. |
|
|
|
- **Fine-Tuning**: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results. |
|
|
|
- **Bias and Fairness**: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks. |
|
|
|
### Conclusion |
|
|
|
EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations. |
|
|
|
|
|
--- |
|
|
|
## Training procedure |
|
|
|
Dedicated Byte Level BPE tokenizer vocabulary size 2**16, min frequency 2 |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 32 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 1 |
|
|
|
### Training results |
|
|
|
Coming soon |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.33.3 |
|
- Pytorch 2.0.1+cu117 |
|
- Datasets 2.14.5 |
|
- Tokenizers 0.13.3 |
|
|
|
### Infrastructure |
|
|
|
- **Hardware Type:** 4 x GPUs 24GB |
|
- **GPU Days:** 16 |
|
- **Cloud Provider:** EuroHPC |
|
- **Compute Region:** Meluxina |
|
|
|
|
|
# Model Card Authors |
|
|
|
Sebastien Campion |
|
|
|
# Model Card Contact |
|
|
|
[email protected] |
|
|