|
--- |
|
license: mit |
|
datasets: |
|
- newsmediabias/BIAS-CONLL |
|
--- |
|
|
|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Named entity recognition |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned token classification model designed to predict entities in sentences. |
|
It's fine-tuned on a custom dataset that focuses on identifying certain types of entities, including biases in text. |
|
|
|
## Intended Use |
|
|
|
The model is intended to be used for entity recognition tasks, especially for identifying biases in text passages. |
|
Users can input a sequence of text, and the model will highlight words or tokens or **spans** it believes are associated with a particular entity or bias. |
|
|
|
https://www.sciencedirect.com/science/article/abs/pii/S0957417423020444 |
|
|
|
## How to Use |
|
|
|
The model can be used for inference directly through the Hugging Face `transformers` library: |
|
|
|
```python |
|
|
|
|
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
import torch |
|
|
|
device = torch.device("cpu") |
|
|
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("newsmediabias/UnBIAS-NER") |
|
model = AutoModelForTokenClassification.from_pretrained("newsmediabias/UnBIAS-NER") |
|
|
|
def highlight_biased_entities(sentence): |
|
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence))) |
|
inputs = tokenizer.encode(sentence, return_tensors="pt") |
|
inputs = inputs.to(device) |
|
|
|
outputs = model(inputs).logits |
|
predictions = torch.argmax(outputs, dim=2) |
|
|
|
id2label = model.config.id2label |
|
|
|
# Reconstruct words from subword tokens and highlight them |
|
highlighted_sentence = "" |
|
current_word = "" |
|
is_biased = False |
|
for token, prediction in zip(tokens, predictions[0]): |
|
label = id2label[prediction.item()] |
|
if label in ['B-BIAS', 'I-BIAS']: |
|
if token.startswith('##'): |
|
current_word += token[2:] |
|
else: |
|
if current_word: |
|
if is_biased: |
|
highlighted_sentence += f"BIAS[{current_word}] " |
|
else: |
|
highlighted_sentence += f"{current_word} " |
|
current_word = token |
|
else: |
|
current_word = token |
|
is_biased = True |
|
else: |
|
if current_word: |
|
if is_biased: |
|
highlighted_sentence += f"BIAS[{current_word}] " |
|
else: |
|
highlighted_sentence += f"{current_word} " |
|
current_word = "" |
|
highlighted_sentence += f"{token} " |
|
is_biased = False |
|
if current_word: |
|
if is_biased: |
|
highlighted_sentence += f"BIAS[{current_word}]" |
|
else: |
|
highlighted_sentence += current_word |
|
|
|
# Filter out special tokens and subword tokens |
|
highlighted_sentence = highlighted_sentence.replace(' [', '[').replace(' ]', ']').replace(' ##', '') |
|
|
|
return highlighted_sentence |
|
|
|
sentence = "due to your evil and dishonest nature, i am kind of tired and want to get rid of such cheapters. all people like you are evil and a disgrace to society and I must say to get rid of immigrants as they are filthy to culture" |
|
highlighted_sentence = highlight_biased_entities(sentence) |
|
print(highlighted_sentence) |
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
## Limitations and Biases |
|
|
|
Every model has limitations, and it's crucial to understand these when deploying models in real-world scenarios: |
|
|
|
1. **Training Data**: The model is trained on a specific dataset, and its predictions are only as good as the data it's trained on. |
|
2. **Generalization**: While the model may perform well on certain types of sentences or phrases, it might not generalize well to all types of text or contexts. |
|
|
|
It's also essential to be aware of any potential biases in the training data, which might affect the model's predictions. |
|
|
|
## Training Data |
|
|
|
The model was fine-tuned on a custom dataset. Ask **Shaina Raza [email protected]** for dataset |