EUBERT / README.md

Update README.md

3ac748f verified about 1 year ago

4.46 kB

	---
	tags:
	- generated_from_trainer
	model-index:
	- name: EUBERT
	results: []
	language:
	- bg
	- cs
	- da
	- de
	- el
	- en
	- es
	- et
	- fi
	- fr
	- ga
	- hr
	- hu
	- it
	- lt
	- lv
	- mt
	- nl
	- pl
	- pt
	- ro
	- sk
	- sl
	- sv
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->


	## Model Card: EUBERT

	### Overview

	- Model Name: EUBERT
	- Model Version: 1.0
	- Date of Release: 02 October 2023
	- Model Architecture: BERT (Bidirectional Encoder Representations from Transformers)
	- Training Data: Documents registered by the European Publications Office
	- Model Use Case: Text Classification, Question Answering, Language Understanding

	![EUBERT](https://huggingface.co/EuropeanParliament/EUBERT/resolve/main/EUBERT_small.png)


	### Model Description

	EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/).
	These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains.
	EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks,
	making it a valuable resource for a variety of applications.

	### Intended Use

	EUBERT serves as a starting point for building more specific natural language understanding models.
	Its versatility makes it suitable for a wide range of tasks, including but not limited to:

	1. Text Classification: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection.

	2. Question Answering: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization.

	3. Language Understanding: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation.

	### Performance

	The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning.
	Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly.

	### Considerations

	- Data Privacy and Compliance: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information.

	- Fine-Tuning: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results.

	- Bias and Fairness: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks.

	### Conclusion

	EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations.


	---

	## Training procedure

	Dedicated Byte Level BPE tokenizer vocabulary size 2**16, min frequency 2

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 1

	### Training results

	Coming soon

	### Framework versions

	- Transformers 4.33.3
	- Pytorch 2.0.1+cu117
	- Datasets 2.14.5
	- Tokenizers 0.13.3

	### Infrastructure

	- Hardware Type: 4 x GPUs 24GB
	- GPU Days: 16
	- Cloud Provider: EuroHPC
	- Compute Region: Meluxina


	# Model Card Authors

	Sebastien Campion

	# Model Card Contact

	[email protected]