--- license: mit language: multilingual library_name: torch tags: [] base_model: BAAI/bge-m3 datasets: - philipp-zettl/GGU-xx - philipp-zettl/sentiment metrics: - accuracy - precision - recall - f1-score model_name: Multi-Head Sequence Classification Model pipeline_tag: text-classification widget: - text: "Hello, how are you?" label: "[GGU] Greeting" - text: "Thank you for your help" label: "[GGU] Gratitude" - text: "Hallo, wie geht es dir?" label: "[GGU] Greeting (de)" - text: "Danke dir." label: "[GGU] Gratitude (de)" - text: "I am not sure what you mean" label: "[GGU] Other" - text: "Generate me an image of a dog!" label: "[GGU] Other" - text: "What is the weather like today?" label: "[GGU] Other" - text: "Wie ist das Wetter heute?" label: "[GGU] Other (de)" --- # Multi-Head Sequence Classification Model ## Model description The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model. Multiple heads are added to the output of the backbone to classify the input sequence. ### Model architecture The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model. The backbone of the model is BAAI/bge-m3 with 1024 output dimensions. An additional layer of (GGU: 3, sentiment: 3) is added to the output of the backbone to classify the input sequence. You can find a mapping for the labels here: **GGU** - 0: Greeting - 1: Gratitude - 2: Other **sentiment** - 0: Positive - 1: Negative - 2: Neutral The joint architecture was trained using the provided implementation (in repository) of `MultiHeadClassificationTrainer`. ### Use cases Use cases: text classification, sentiment analysis. ## Model Inference Inference code: ```python from transformers import AutoModel, AutoTokenizer from .model import MultiHeadSequenceClassificationModel import torch model = MultiHeadSequenceClassificationModel.from_pretrained('philipp-zettl/multi-head-sequence-classification-model') tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3') def predict(text): inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True) outputs = model(**inputs) return outputs ``` ## Model Training #### Confusion Matrix **GGU** ![Confusion Matrix GGU](assets/confusion_matrix_GGU.png) **sentiment** ![Confusion Matrix sentiment](assets/confusion_matrix_sentiment.png) #### Training Loss **GGU** ![Loss GGU](assets/loss_plot_GGU.png) **sentiment** ![Loss sentiment](assets/loss_plot_sentiment.png) ### Training data The model has been trained on the following datasets: - [philipp-zettl/GGU-xx](https://huggingface.co/datasets/philipp-zettl/GGU-xx) - [philipp-zettl/sentiment](https://huggingface.co/datasets/philipp-zettl/sentiment) Using the implementation provided by MultiHeadClassificationTrainer ### Training procedure The following code has been executed to train the model: ```python def train_classifier(): backbone = AutoModel.from_pretrained('BAAI/bge-m3').to(torch.float16) tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3') device = 'cuda' if torch.cuda.is_available() else 'cpu' ggu_label_map = { 0: 'Greeting', 1: 'Gratitude', 2: 'Other' } sentiment_label_map = { 0: 'Positive', 1: 'Negative', 2: 'Neutral' } num_labels = len(ggu_label_map.keys()) # HParams dropout = 0.25 learning_rate = 3e-5 momentum = 0.9 l2_reg = 0.25 l2_loss_weight = 0.25 model_conf = { 'backbone': backbone, 'head_config': { 'GGU': num_labels, }, 'dropout': dropout, 'l2_reg': l2_reg, } optimizer_conf = { 'lr': learning_rate, 'momentum': momentum } scheduler_conf = { 'factor': 0.2, 'patience': 3, 'min_lr': 1e-8 } train_run = 1000 trainer = MultiHeadClassificationTrainer( model_conf=model_conf, optimizer_conf={**optimizer_conf, 'lr': 1e-4}, scheduler_conf=scheduler_conf, num_epochs=35, l2_loss_weight=l2_loss_weight, use_lr_scheduler=True, train_run=train_run, auto_find_batch_size=False ) new_model, history = trainer.train(dataset_name='philipp-zettl/GGU-xx', target_heads=['GGU']) metrics = history['metrics'] history['loss_plot'] = trainer._plot_history(**metrics) res = trainer.eval({'GGU': ggu_label_map}) history['evaluation'] = res['GGU'] total_history = { 'GGU': deepcopy(history), } trainer.classifier.add_head('sentiment', 3) trainer.auto_find_batch_size = False new_model, history = trainer.train(dataset_name='philipp-zettl/sentiment', target_heads=['sentiment'], sample_key='text', num_epochs=10, lr=1e-4) metrics = history['metrics'] history['loss_plot'] = trainer._plot_history(**metrics) res = trainer.eval({'sentiment': sentiment_label_map}, sample_key='text') history['evaluation'] = res['sentiment'] total_history['sentiment'] = deepcopy(history) label_maps = { 'GGU': ggu_label_map, 'sentiment': sentiment_label_map, } return new_model, total_history, trainer, label_maps ``` ### Evaluation ### Evaluation data For model evaluation, a 20% validation split was used from the training data. ### Evaluation procedure The model was evaluated using the `eval` method provided by the `MultiHeadClassificationTrainer` class: ```python def _eval_model(self, dataloader, label_map, sample_key, label_key): self.classifier.train(False) eval_heads = list(label_map.keys()) y_pred = {h: [] for h in eval_heads} y_test = {h: [] for h in eval_heads} for sample in tqdm(dataloader, total=len(dataloader), desc='Evaluating model...'): labels = {name: sample[label_key] for name in eval_heads} embeddings = BatchEncoding({k: torch.stack(v, dim=1).to(self.device) for k, v in sample.items() if k not in [label_key, sample_key]}) output = self.classifier(embeddings.to('cuda'), head_names=eval_heads) for head in eval_heads: y_pred[head].extend(output[head].argmax(dim=1).cpu()) y_test[head].extend(labels[head]) torch.cuda.empty_cache() accuracies = {h: accuracy_score(y_test[h], y_pred[h]) for h in eval_heads} f1_scores = {h: f1_score(y_test[h], y_pred[h], average="macro") for h in eval_heads} recalls = {h: recall_score(y_test[h], y_pred[h], average='macro') for h in eval_heads} report = {} for head in eval_heads: cm = confusion_matrix(y_test[head], y_pred[head], labels=list(label_map[head].keys())) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=list(label_map[head].values())) clf_report = classification_report( y_test[head], y_pred[head], output_dict=True, target_names=list(label_map[head].values()) ) del clf_report["accuracy"] clf_report = pd.DataFrame(clf_report).T.reset_index() report[head] = dict( clf_report=clf_report, confusion_matrix=disp, metrics={'accuracy': accuracies[head], 'f1': f1_scores[head], 'recall': recalls[head]} ) return report ``` ### Metrics For evaluation, we used the following metrics: accuracy, precision, recall, f1-score. You can find a detailed classification report here: **GGU:** | | index | precision | recall | f1-score | support | |---:|:-------------|------------:|---------:|-----------:|----------:| | 0 | Greeting | 0.904762 | 0.974359 | 0.938272 | 39 | | 1 | Gratitude | 0.958333 | 0.851852 | 0.901961 | 27 | | 2 | Other | 1 | 1 | 1 | 39 | | 3 | macro avg | 0.954365 | 0.94207 | 0.946744 | 105 | | 4 | weighted avg | 0.953912 | 0.952381 | 0.951862 | 105 | **sentiment:** | | index | precision | recall | f1-score | support | |---:|:-------------|------------:|---------:|-----------:|----------:| | 0 | Positive | 0.783088 | 0.861878 | 0.820596 | 12851 | | 1 | Negative | 0.802105 | 0.819524 | 0.810721 | 14229 | | 2 | Neutral | 0.7874 | 0.6913 | 0.736227 | 13126 | | 3 | macro avg | 0.790864 | 0.790901 | 0.789181 | 40206 | | 4 | weighted avg | 0.791226 | 0.7912 | 0.789557 | 40206 |