--- language: fr license: mit datasets: - Sequoia widget: - text: Aucun financement politique occulte n'a pu être mis en évidence. - text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue. pipeline_tag: token-classification tags: - mwe --- # Multiword expressions recognition. A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs. ## Model description `camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert/camembert-large) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task. ## How to use You can use this model directly with a pipeline for token classification: ```python >>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline >>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer") >>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer") >>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer) >>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants." >>> mwes = mwe_classifier(sentence) [{'entity': 'B-MWE', 'score': 0.99492574, 'index': 4, 'word': '▁rendez', 'start': 15, 'end': 22}, {'entity': 'I-MWE', 'score': 0.9344883, 'index': 5, 'word': '-', 'start': 22, 'end': 23}, {'entity': 'I-MWE', 'score': 0.99398583, 'index': 6, 'word': 'vous', 'start': 23, 'end': 27}, {'entity': 'B-VID', 'score': 0.9827843, 'index': 22, 'word': '▁mettre', 'start': 106, 'end': 113}, {'entity': 'I-VID', 'score': 0.9835186, 'index': 23, 'word': '▁en', 'start': 113, 'end': 116}, {'entity': 'I-VID', 'score': 0.98324823, 'index': 24, 'word': '▁bouche', 'start': 116, 'end': 123}] >>> mwe_classifier.group_entities(mwes) [{'entity_group': 'MWE', 'score': 0.9744666, 'word': 'rendez-vous', 'start': 15, 'end': 27}, {'entity_group': 'VID', 'score': 0.9831837, 'word': 'mettre en bouche', 'start': 106, 'end': 123}] ``` ## Training data The Sequoia dataset is divided into train/dev/test sets: | | Sequoia | train | dev | test | | :----: | :---: | :----: | :---: | :----: | | #sentences | 3099 | 1955 | 273 | 871 | | #MWEs | 3450 | 2170 | 306 | 974 | | #Unseen MWEs | _ | _ | 100 | 300 | This dataset has 6 distinct categories: * MWE: Non-verbal MWEs (e.g. **à peu près**) * IRV: Inherently reflexive verb (e.g. **s'occuper**) * LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**) * LVC.full: Light-verb construction (e.g. **avoir pour but** de ) * MVC: Multi-verb construction (e.g. **faire remarquer**) * VID: Verbal idiom (e.g. **voir le jour**) ## Training procedure ### Preprocessing The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology. ### Pretraining The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs. ### Evaluation results On the test set, this model achieves the following results:
Global MWE-based | Unseen MWE-based | ||||
Precision | Recall | F1 | Precision | Recall | F1 |
83.78 | 83.78 | 83.78 | 57.05 | 60.67 | 58.80 |