File size: 4,906 Bytes
132104e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: mit
language:
- multilingual
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: token-classification
---

# Multilingual Identification of English Code-Switching

AnE-NER (Any-English Code-Switching Named Entity Recognition) is a token-level model for detecting named entities in code-switching texts. It classifies words into two classes: `I` (inside a named entity) and `O` (outside a named entity). The model shows strong performance on both languages seen and unseen in the training data.

# Usage

You can use AnE-NER with Huggingface’s `pipeline` or `AutoModelForTokenClassification`.

Let's try the following example (taken from [this](https://aclanthology.org/W18-3213/) paper)

```python
input = "My Facebook, Ig & Twitter is hellaa dead yall Jk soy yo que has no life!"
```

## Pipeline

```python
from transformers import pipeline
classifier = pipeline("token-classification", model="igorsterner/AnE-NER", aggregation_strategy="simple")
result = classifier(input)
```

which returns

```
[{'entity_group': 'I',
  'score': 0.95482016,
  'word': 'Facebook',
  'start': 3,
  'end': 11},
 {'entity_group': 'I',
  'score': 0.9638739,
  'word': 'Ig',
  'start': 13,
  'end': 15},
 {'entity_group': 'I',
  'score': 0.98207414,
  'word': 'Twitter',
  'start': 18,
  'end': 25}]
```

## Advanced

If your input is already word-tokenized, and you want the corresponding word NER labels, you can try the following strategy

```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

lid_model_name = "igorsterner/AnE-NER"
lid_tokenizer = AutoTokenizer.from_pretrained(lid_model_name)
lid_model = AutoModelForTokenClassification.from_pretrained(lid_model_name)

word_tokens = ['My', 'Facebook', ',', 'Ig', '&', 'Twitter', 'is', 'hellaa', 'dead', 'yall', 'Jk', 'soy', 'yo', 'que', 'has', 'no', 'life', '!']

subword_inputs = lid_tokenizer(
    word_tokens, truncation=True, is_split_into_words=True, return_tensors="pt"
)

subword2word = subword_inputs.word_ids(batch_index=0)
logits = lid_model(**subword_inputs).logits
predictions = torch.argmax(logits, dim=2)

predicted_subword_labels = [lid_model.config.id2label[t.item()] for t in predictions[0]]
predicted_word_labels = [[] for _ in range(len(word_tokens))]

for idx, predicted_subword in enumerate(predicted_subword_labels):
    if subword2word[idx] is not None:
        predicted_word_labels[subword2word[idx]].append(predicted_subword)

def most_frequent(lst):
    return max(set(lst), key=lst.count) if lst else "Other"

predicted_word_labels = [most_frequent(sublist) for sublist in predicted_word_labels]

for token, label in zip(word_tokens, predicted_word_labels):
    print(f"{token}: {label}")
```

which returns

```
My: O
Facebook: I
,: O
Ig: I
&: O
Twitter: I
is: O
hellaa: O
dead: O
yall: O
Jk: O
soy: O
yo: O
que: O
has: O
no: O
life!: O
```

# Word-level language labels

If you also want the language of each word, you can additionaly run [AnE-LID](https://huggingface.co/igorsterner/ane-lid). Checkout my evaluation scripts for examples of using both at the same time, as we did in the paper: [https://github.com/igorsterner/AnE/tree/main/eval](https://github.com/igorsterner/AnE/tree/main/eval).

For the above example, you can get:

```
My: English
Facebook: NE.English
,: Other
Ig: NE.English
&: Other
Twitter: NE.English
is: English
hellaa: English
dead: English
yall: English
Jk: English
soy: notEnglish
yo: notEnglish
que: notEnglish
has: English
no: English
life: English
!: Other
```

# Citation

Please consider citing my work if it helped you

```
@inproceedings{sterner-2024-multilingual,
    title = "Multilingual Identification of {E}nglish Code-Switching",
    author = "Sterner, Igor",
    editor = {Scherrer, Yves  and
      Jauhiainen, Tommi  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      Zampieri, Marcos  and
      Nakov, Preslav  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.vardial-1.14",
    doi = "10.18653/v1/2024.vardial-1.14",
    pages = "163--173",
    abstract = "Code-switching research depends on fine-grained language identification. In this work, we study existing corpora used to train token-level language identification systems. We aggregate these corpora with a consistent labelling scheme and train a system to identify English code-switching in multilingual text. We show that the system identifies code-switching in unseen language pairs with absolute measure 2.3-4.6{\%} better than language-pair-specific SoTA. We also analyse the correlation between typological similarity of the languages and difficulty in recognizing code-switching.",
}
```