edumunozsala
/

bertin_base_sentiment_analysis_es

Text Classification

TextClassification

SentimentAnalysis

Inference Endpoints

Model card Files Files and versions Community

bertin_base_sentiment_analysis_es / README.md

edumunozsala's picture

Upload README.md

4c12604 over 2 years ago

|

3.29 kB

	---
	language: es
	tags:
	- sagemaker
	- bertin
	- TextClassification
	- SentimentAnalysis
	license: apache-2.0
	datasets:
	- IMDbreviews_es
	metrics:
	- accuracy
	model-index:
	- name: bertin_base_sentiment_analysis_es
	results:
	- task:
	name: Sentiment Analysis
	type: sentiment-analysis
	dataset:
	name: "IMDb Reviews in Spanish"
	type: IMDbreviews_es
	metrics:
	- name: Accuracy,
	type: accuracy,
	value: 0.898933
	- name: F1 Score,
	type: f1,
	value: 0.8989063
	- name: Precision,
	type: precision,
	value: 0.8771473
	- name: Recall,
	type: recall,
	value: 0.9217724
	widget:
	- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
	---

	## Model `bertin_base_sentiment_analysis_es`

	### A finetuned model for Sentiment analysis in Spanish

	This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
	The base model is Bertin base which is a RoBERTa-base model pre-trained on the Spanish portion of mC4 using Flax.
	It was trained by the Bertin Project.[Link to base model](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)

	Article: BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
	Author = Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury,
	journal = Procesamiento del Lenguaje Natural,
	volume = 68, number = 0, year = 2022
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},

	## Dataset
	The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.

	Sizes of datasets:
	- Train dataset: 42,500
	- Validation dataset: 3,750
	- Test dataset: 3,750

	## Intended uses & limitations

	This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.

	## Hyperparameters
	{
	"epochs": "4",
	"train_batch_size": "32",
	"eval_batch_size": "8",
	"fp16": "true",
	"learning_rate": "3e-05",
	"model_name": "\"bertin-project/bertin-roberta-base-spanish\"",
	"sagemaker_container_log_level": "20",
	"sagemaker_program": "\"train.py\"",
	}

	## Evaluation results
	Accuracy = 0.8989333333333334
	F1 Score = 0.8989063750333421
	Precision = 0.877147319104633
	Recall = 0.9217724288840262

	## Test results

	## Model in action

	### Usage for Sentiment Analysis

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")
	model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")

	text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"

	input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
	outputs = model(input_ids)
	output = outputs.logits.argmax(1)
	```

	Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)