My result has unnecessarily split words from my input along with some extra characters.
The problem is that when i input a sentence into the model the result I receive is not for the same words I had in the sentence. Often the words are split into two words and there is an extra "##" in front of the second half of the word.
for example
Input:
Modi increased the education budget by 20
Output:
For the First word itself "Modi" The word was split into "Mod" and "##i" with those unnecessary "##". I am entering the sentence I have given above but I can not understand the reason why I am receiving the result like this. My guess is that it is most likely not an issue from my end but I am open to the suggestions if I am doing something wrong.
Hello,
Modern language models use subwords to represent words that are not directly in their vocabulary. In BERT's case, this is word piece tokenization (https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt). From a practical point of view, what you can do is pick either the first of the tokenized word (so "Mod" for "Modi"), or do some sort of majority voting over all subtokens. The subtokens will always be of the form word
-> subword1 ##subword2 ##subword3
.
Hope this helps!