My result has unnecessarily split words from my input along with some extra characters.

by sukumar-rtg - opened May 9, 2023

May 9, 2023

The problem is that when i input a sentence into the model the result I receive is not for the same words I had in the sentence. Often the words are split into two words and there is an extra "##" in front of the second half of the word.

for example

Input:

Modi increased the education budget by 20

Output:

For the First word itself "Modi" The word was split into "Mod" and "##i" with those unnecessary "##". I am entering the sentence I have given above but I can not understand the reason why I am receiving the result like this. My guess is that it is most likely not an issue from my end but I am open to the suggestions if I am doing something wrong.

fdalvi

Arabic Language Technologies, Qatar Computing Research Institute org May 10, 2023

Hello,

Modern language models use subwords to represent words that are not directly in their vocabulary. In BERT's case, this is word piece tokenization (https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt). From a practical point of view, what you can do is pick either the first of the tokenized word (so "Mod" for "Modi"), or do some sort of majority voting over all subtokens. The subtokens will always be of the form word -> subword1 ##subword2 ##subword3.

Hope this helps!

sukumar-rtg changed discussion status to closed May 20, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment