File size: 11,168 Bytes
ab03e32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
import streamlit as st

# TODO: move to 'utils'
mystyle = '''
    <style>
        p {
            text-align: justify;
        }
    </style>
    '''
st.markdown(mystyle, unsafe_allow_html=True)


def divider():
    _, c, _ = st.columns(3)
    c.divider()

st.title("Transformers: Tokenisers and Embeddings")

preface_image, preface_text,  = st.columns(2)
# preface_image.image("https://static.streamlit.io/examples/dice.jpg")
# preface_image.image("""https://assets.digitalocean.com/articles/alligator/boo.svg""")
preface_text.write("""*Transformers represent a revolutionary class of machine learning architectures that have sparked 
immense interest. While numerous insightful tutorials are available, the evolution of transformer architectures over 
the last few years has led to significant simplifications. These advancements have made it increasingly 
straightforward to understand their inner workings. In this series of articles, I aim to provide a direct, clear explanation of 
how and why modern transformers function, unburdened by the historical complexities associated with their inception.*
""")

divider()

st.write("""In order to understand the recent success in AI we need to understand the Transformer architecture. Its 
rise in the field of Natural Language Processing (NLP) is largely attributed to a combination of several key 
advancements:

- Tokenisers and Embeddings 
- Attention and Self-Attention
- Encoder-Decoder architecture

Understanding these foundational concepts is crucial to comprehending the overall structure and function of the 
Transformer model. They are the building blocks from which the rest of the model is constructed, and their roles 
within the architecture are essential to the model's ability to process and generate language.

Given the importance and complexity of these concepts, I have chosen to dedicate the first article in this series 
solely to Tokenisation and embeddings. The decision to separate the topics into individual articles is driven by a 
desire to provide a thorough and in-depth understanding of each component of the Transformer model.


""")

with st.expander("Copernicus Museum in Warsaw"):
    st.write(""" 
Have you ever visited the Copernicus Museum in Warsaw? It's an engaging interactive hub that allows 
you to familiarize yourself with various scientific topics. The experience is both entertaining and educational, 
providing the opportunity to explore different concepts firsthand. **They even feature a small neural network that 
illustrates the neuron activation process during the recognition of handwritten digits!**

Taking inspiration from this approach, we'll embark on our journey into the world of Transformer models by first 
establishing a firm understanding of Tokenisation and embeddings. This foundation will equip us with the knowledge 
needed to delve into the more complex aspects of these models later on.

I encourage you not to hesitate in modifying parameters or experimenting with different models in the provided 
examples. This hands-on exploration can significantly enhance your learning experience. So, let's begin our journey 
through this virtual, interactive museum of AI. Enjoy the exploration!
""")
    st.image("https://i.pinimg.com/originals/04/11/2c/04112c791a859d07a01001ac4f436e59.jpg")

divider()

st.header("Tokenisers and Tokenisation")

st.write("""Tokenisation is the initial step in the data preprocessing pipeline for natural language processing (NLP) 
models. It involves breaking down a piece of text—whether a sentence, paragraph, or document—into smaller units, 
known as "tokens". In English and many other languages, a token often corresponds to a word, but it can also be a 
subword, character, or n-gram. The choice of token size depends on various factors, including the task at hand and 
the language of the text.
""")

from transformers import AutoTokenizer

sentence = st.text_input("Sentence to explore (you can change it):", value="Tokenising text is a fundamental step for NLP models.")
sentence_split = sentence.split()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentence_tokenise_bert = tokenizer.tokenize(sentence)
sentence_encode_bert = tokenizer.encode(sentence)
sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))

st.write(f"""
Consider the sentence:
""")
st.code(f"""
"{sentence}"
""")

st.write(f"""
A basic word-level Tokenisation would produce tokens:
""")
st.code(f"""
{sentence_split}
""")


st.write(f"""
However, a more sophisticated algorithm, with several optimizations, might generate a different set of tokens: 
""")
st.code(f"""
{sentence_tokenise_bert}
""")

with st.expander("click to look at the code:"):
    st.code(f"""\
from transformers import AutoTokenizer

sentence = st.text_input("Sentence to explore (you can change it):", value="{sentence}")
sentence_split = sentence.split()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentence_tokenise_bert = tokenizer.tokenize(sentence)
sentence_encode_bert = tokenizer.encode(sentence)
sentence_encode_bert = list(zip(sentence_tokenise_bert, sentence_encode_bert))
    """, language='python')


st.write("""
As machine learning models, including Transformers, work with numbers rather than words, each vocabulary 
entry is assigned a corresponding numerical value. Here is a potential key-value, vocabulary-based representation of 
the input (so called 'token ids'):
"""
)

st.code(f"""
{sentence_encode_bert}
""")


st.write("""
What distinguishes subword Tokenisation is its reliance on statistical rules and algorithms, learned from 
the pretraining corpus. The resulting Tokeniser creates a vocabulary, which usually represents the most frequently 
used words and subwords. For example, Byte Pair Encoding (BPE) first encodes the most frequent words as single 
tokens, while less frequent words are represented by multiple tokens, each representing a word part.

There are numerous different Tokenisers available, including spaCy, Moses, Byte-Pair Encoding (BPE), 
Byte-level BPE, WordPiece, Unigram, and SentencePiece. It's crucial to choose a specific Tokeniser and stick with it. 
Changing the Tokeniser is akin to altering the model's language on the fly—imagine studying physics in English and 
then taking the exam in French or Spanish. You might get lucky, but it's a considerable risk.
""")

with st.expander("""Let's train a tokeniser using our own dataset"""):
    training_dataset = """\
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
"""
    training_dataset = st.text_area("*Training Dataset - Vocabulary:*", value=training_dataset, height=200)
    training_dataset = training_dataset.split('\n')
    vocabulary_size = st.number_input("Vocabulary Size:", value=100000)


    # TODO: add more tokenisers
    from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
    # tokenizer = Tokenizer(models.Unigram())
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    tokenizer.decoder = decoders.ByteLevel()
    trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocabulary_size)

    # trainer = trainers.UnigramTrainer(
    #     vocab_size=20000,
    #     initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    #     special_tokens=["<PAD>", "<BOS>", "<EOS>"],
    # )

    tokenizer.train_from_iterator(training_dataset, trainer=trainer)

    sentence = st.text_input("*Text to tokenise:*", value="[CLS]  Tokenising text is a fundamental step for NLP models. [SEP] [PAD] [PAD] [PAD]")
    output = tokenizer.encode(sentence)

    st.write("*Tokens:*")
    st.code(f"""{output.tokens}""")
    st.code(f"""\
    ids: {output.ids}
    attention_mast: {output.attention_mask}
    """)



    st.subheader("Try Yourself:")
    st.write(f""" *Aim to find or create a comprehensive vocabulary (training dataset) for Tokenisation, which can enhance the 
        efficiency of the process. This approach helps to eliminate unknown tokens, thereby making the token sequence 
        more understandable and containing less tokens* 
    """)

    st.caption("Special tokens meaning:")
    st.write("""
\\#\\# prefix: It means that the preceding string is not whitespace, any token with this prefix should be 
merged with the previous token when you convert the tokens back to a string.

[UNK]: Stands for "unknown". This token is used to represent any word that is not in the model's vocabulary. Since 
most models have a fixed-size vocabulary, it's not possible to have a unique token for every possible word. The [UNK] 
token is used as a catch-all for any words the model hasn't seen before. E.g. in our example we 'decided' that Large 
Language (LL) abbreviation is not part of the model's vocabulary.
 
[CLS]: Stands for "classification". In models like BERT, this token is added at the beginning of every input 
sequence. The representation (embedding) of this token is used as the aggregate sequence representation for 
classification tasks. In other words, the model is trained to encode the meaning of the entire sequence into this token.

[SEP]: Stands for "separator". This token is used to separate different sequences when the model needs to take more 
than one input sequence. For example, in question-answering tasks, the model takes two inputs: a question and a 
passage that contains the answer. The two inputs are separated by a [SEP] token.

[MASK]: This token is specific to models like BERT, which are trained with a masked language modelling objective. 
During training, some percentage of the input tokens are replaced with the [MASK] token, and the model's goal is to 
predict the original value of the masked tokens.
 
[PAD]: Stands for "padding". This token is used to fill in the extra spaces when batching sequences of different 
lengths together. Since models require input sequences to be the same length, shorter sequences are extended with [
PAD] tokens. In our example, we extended the length of the input sequence to 16 tokens.

    """)
    st.caption("Python code:")
    st.code(f"""
from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size={vocabulary_size})
training_dataset = {training_dataset}
tokenizer.train_from_iterator(training_dataset, trainer=trainer)
output = tokenizer.encode("{sentence}")
            """, language='python')


with st.expander("References:"):
    st.write("""\
- https://huggingface.co/docs/transformers/tokenizer_summary
- https://huggingface.co/docs/tokenizers/training_from_memory
- https://en.wikipedia.org/wiki/Byte_pair_encoding
    
    """)

divider()
st.header("Embeddings")
st.caption("TBD...")