### Tokenizer

A completely seperate, independent module from the LLM. which has its own training dataset of text, on which you train the vocabulary using the BPE(Byte pair encoding) algorithm. It then translates back and forth between the raw text and the sequence of integers/tokens. LLM only deals with the tokens and never directly deals with the text.

![image.png](../public/tokenizer.png)

In [8]:
# the unicode code point of the character
ord('a')

97

In [9]:
tokens = [ord(c) for c in "మీరు ఎలా ఉన్నారు? (How are you?)"]
tokens

[3118,
 3136,
 3120,
 3137,
 32,
 3086,
 3122,
 3134,
 32,
 3081,
 3112,
 3149,
 3112,
 3134,
 3120,
 3137,
 63,
 32,
 40,
 72,
 111,
 119,
 32,
 97,
 114,
 101,
 32,
 121,
 111,
 117,
 63,
 41]

but having the token for each letter will increase the computation cost to generate and also train the model. so the BPE algorithm to introduced in the [GPT2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

### Byte pair encoding algorithm

consider the string:

`aaabdaaabac`

the byte pair "aa" is the most occuring in the string, so we replace that with a new byte which is not used in the `vocab`, let's say "Z".
Now the following string will be

```
ZabdZabac
Z = aa
```

this process will be continued with recursive byte pair encoding replacing all the byte pairs till the string/data cannot be compressed further.


Then the process is repeated with byte pair "ab", replacing it with "Y"
```
ZYdZYac
Y=ab
Z=aa
```
replacing "ZY" with "X"
```
XdXac
X=ZY
Y=ab
Z=aa
```




In [10]:
text = """Autogen enables the next-gen LLM applications with a generic [multi-agent conversation](https://microsoft.github.io/autogen/docs/Use-Cases/agent_chat) framework. It offers customizable and conversable agents that integrate LLMs, tools, and humans.
By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code.

Features of this use case include:

- **Multi-agent conversations**: AutoGen agents can communicate with each other to solve tasks. This allows for more complex and sophisticated applications than would be possible with a single LLM.
- **Customization**: AutoGen agents can be customized to meet the specific needs of an application. This includes the ability to choose the LLMs to use, the types of human input to allow, and the tools to employ.
- **Human participation**: AutoGen seamlessly allows human participation. This means that humans can provide input and feedback to the agents as needed.

For [example](https://github.com/microsoft/autogen/blob/main/test/twoagent.py),

```python
from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
# Load LLM inference endpoints from an env variable or a file
# See https://microsoft.github.io/autogen/docs/FAQ#set-your-api-endpoints
# and OAI_CONFIG_LIST_sample
config_list = config_list_from_json(env_or_file="OAI_CONFIG_LIST")
# You can also set config_list directly as a list, for example, config_list = [{'model': 'gpt-4', 'api_key': ''},]
assistant = AssistantAgent("assistant", llm_config={"config_list": config_list})
user_proxy = UserProxyAgent("user_proxy", code_execution_config={"work_dir": "coding", "use_docker": False}) # IMPORTANT: set to True to run code in docker, recommended
user_proxy.initiate_chat(assistant, message="Plot a chart of NVDA and TESLA stock price change YTD.")
# This initiates an automated chat between the two agents to solve the task
```

more python code:

```python
 def create(
 self,
 *,
 messages: Iterable[ChatCompletionMessageParam],
 model: Union[str, ChatModel],
 frequency_penalty: Optional[float] | NotGiven = NOT_GIVEN,
 function_call: completion_create_params.FunctionCall | NotGiven = NOT_GIVEN,
 functions: Iterable[completion_create_params.Function] | NotGiven = NOT_GIVEN,
 logit_bias: Optional[Dict[str, int]] | NotGiven = NOT_GIVEN,
 logprobs: Optional[bool] | NotGiven = NOT_GIVEN,
 max_tokens: Optional[int] | NotGiven = NOT_GIVEN,
 n: Optional[int] | NotGiven = NOT_GIVEN,
 presence_penalty: Optional[float] | NotGiven = NOT_GIVEN,
 response_format: completion_create_params.ResponseFormat | NotGiven = NOT_GIVEN,
 seed: Optional[int] | NotGiven = NOT_GIVEN,
 stop: Union[Optional[str], List[str]] | NotGiven = NOT_GIVEN,
 stream: Optional[Literal[False]] | Literal[True] | NotGiven = NOT_GIVEN,
 stream_options: Optional[ChatCompletionStreamOptionsParam] | NotGiven = NOT_GIVEN,
 temperature: Optional[float] | NotGiven = NOT_GIVEN,
 tool_choice: ChatCompletionToolChoiceOptionParam | NotGiven = NOT_GIVEN,
 tools: Iterable[ChatCompletionToolParam] | NotGiven = NOT_GIVEN,
 top_logprobs: Optional[int] | NotGiven = NOT_GIVEN,
 top_p: Optional[float] | NotGiven = NOT_GIVEN,
 user: str | NotGiven = NOT_GIVEN,
 # Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
 # The extra values given here take precedence over values defined on the client or passed to this method.
 extra_headers: Headers | None = None,
 extra_query: Query | None = None,
 extra_body: Body | None = None,
 timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
 ) -> ChatCompletion | Stream[ChatCompletionChunk]:
 return self._post(
 "/chat/completions",
 body=maybe_transform(
 {
 "messages": messages,
 "model": model,
 "frequency_penalty": frequency_penalty,
 "function_call": function_call,
 "functions": functions,
 "logit_bias": logit_bias,
 "logprobs": logprobs,
 "max_tokens": max_tokens,
 "n": n,
 "presence_penalty": presence_penalty,
 "response_format": response_format,
 "seed": seed,
 "stop": stop,
 "stream": stream,
 "stream_options": stream_options,
 "temperature": temperature,
 "tool_choice": tool_choice,
 "tools": tools,
 "top_logprobs": top_logprobs,
 "top_p": top_p,
 "user": user,
 },
 completion_create_params.CompletionCreateParams,
 ),
 options=make_request_options(
 extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
 ),
 cast_to=ChatCompletion,
 stream=stream or False,
 stream_cls=Stream[ChatCompletionChunk],
 )
```
"""
tokens = text.encode('utf-8') # which will produce raw byte strings
tokens = list(map(int, tokens)) # convert the byte strings to integers
print('----')
print(text)
print('length:', len(text))
print('----')
print(tokens)
print('length:', len(tokens))


----
Autogen enables the next-gen LLM applications with a generic [multi-agent conversation](https://microsoft.github.io/autogen/docs/Use-Cases/agent_chat) framework. It offers customizable and conversable agents that integrate LLMs, tools, and humans.
By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code.

Features of this use case include:

- **Multi-agent conversations**: AutoGen agents can communicate with each other to solve tasks. This allows for more complex and sophisticated applications than would be possible with a single LLM.
- **Customization**: AutoGen agents can be customized to meet the specific needs of an application. This includes the ability to choose the LLMs to use, the types of human input to allow, and the tools to employ.
- **Human participation**: AutoGen seamlessly allows human participation. This means that humans can provide 

In [11]:
def get_stats(ids):
 """
 Get statistics of the token ids. includes the most common token pairs.
 """
 counts = {}
 for pair in zip(ids, ids[1:]):
 counts[pair] = counts.get(pair, 0) + 1
 return counts

stats = get_stats(tokens)
# print(stats)
print(sorted(((v,k) for k,v in stats.items()), reverse=True))

[(770, (32, 32)), (86, (111, 110)), (73, (101, 110)), (66, (10, 32)), (65, (116, 105)), (57, (44, 10)), (56, (105, 111)), (55, (58, 32)), (55, (32, 116)), (52, (97, 116)), (50, (116, 111)), (50, (101, 32)), (48, (32, 78)), (47, (110, 32)), (44, (114, 101)), (40, (115, 116)), (40, (32, 97)), (38, (115, 32)), (38, (101, 114)), (36, (115, 101)), (35, (97, 108)), (35, (32, 99)), (34, (108, 101)), (32, (116, 104)), (31, (114, 97)), (30, (97, 110)), (29, (110, 116)), (28, (118, 101)), (28, (116, 114)), (28, (111, 109)), (28, (97, 109)), (27, (124, 32)), (27, (103, 101)), (27, (101, 97)), (27, (99, 111)), (27, (78, 111)), (27, (61, 32)), (27, (32, 124)), (27, (32, 61)), (26, (116, 32)), (26, (101, 115)), (25, (105, 110)), (24, (110, 115)), (24, (34, 58)), (24, (32, 34)), (23, (116, 101)), (23, (112, 108)), (23, (109, 112)), (22, (111, 116)), (22, (105, 118)), (22, (101, 116)), (22, (44, 32)), (22, (32, 115)), (21, (112, 116)), (21, (110, 97)), (21, (108, 111)), (21, (105, 115)), (21, (104, 97

In [12]:
chr(32), chr(32) # the space character is the most common character in the text

(' ', ' ')

In [13]:
def merge(ids, pair, idx):
 """
 BPE algorithm
 ids: list of integers(tokens)
 pair: tuple of consecutive integers
 idx: new vocab token to replace the pair
 """
 new_ids = []
 i = 0
 while i < len(ids):
 if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
 new_ids.append(idx)
 i += 2
 else:
 new_ids.append(ids[i])
 i += 1
 return new_ids

# merge the most common pair
tokens2 = merge(tokens, (32, 32), 1000)
print(tokens2)
print('length: ',len(tokens2))

[65, 117, 116, 111, 103, 101, 110, 32, 101, 110, 97, 98, 108, 101, 115, 32, 116, 104, 101, 32, 110, 101, 120, 116, 45, 103, 101, 110, 32, 76, 76, 77, 32, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 115, 32, 119, 105, 116, 104, 32, 97, 32, 103, 101, 110, 101, 114, 105, 99, 32, 91, 109, 117, 108, 116, 105, 45, 97, 103, 101, 110, 116, 32, 99, 111, 110, 118, 101, 114, 115, 97, 116, 105, 111, 110, 93, 40, 104, 116, 116, 112, 115, 58, 47, 47, 109, 105, 99, 114, 111, 115, 111, 102, 116, 46, 103, 105, 116, 104, 117, 98, 46, 105, 111, 47, 97, 117, 116, 111, 103, 101, 110, 47, 100, 111, 99, 115, 47, 85, 115, 101, 45, 67, 97, 115, 101, 115, 47, 97, 103, 101, 110, 116, 95, 99, 104, 97, 116, 41, 32, 102, 114, 97, 109, 101, 119, 111, 114, 107, 46, 32, 73, 116, 32, 111, 102, 102, 101, 114, 115, 32, 99, 117, 115, 116, 111, 109, 105, 122, 97, 98, 108, 101, 32, 97, 110, 100, 32, 99, 111, 110, 118, 101, 114, 115, 97, 98, 108, 101, 32, 97, 103, 101, 110, 116, 115, 32, 116, 104, 97, 116, 32, 105, 1

In [14]:
# complete cycle
def get_stats(ids):
 counts = {}
 for pair in zip(ids, ids[1:]):
 counts[pair] = counts.get(pair, 0) +1 
 return counts

def merge(ids, pair, idx):
 newids = []
 i = 0
 while i < len(ids):
 if i < len(ids) - 1 and ids[i] == pair[0] and ids [i+1] == pair[1]:
 newids.append(idx)
 i += 2
 else:
 newids.append(ids[i])
 i += 1
 return newids

# merge all the common pairs and create a new vocab
vocab_size = 296
num_merges = vocab_size - 256 # the utf-8 vocab size is 256
ids = list(tokens)


merges = {}
for i in range(num_merges):
 stats = get_stats(ids)
 pair = max(stats, key = stats.get) # get the most common pair
 idx = 256 + i # new vocab token
 print(f'merge {pair} to {idx}')
 ids = merge(ids, pair, idx)
 merges[pair] = idx


merge (32, 32) to 256
merge (256, 256) to 257
merge (257, 257) to 258
merge (111, 110) to 259
merge (101, 110) to 260
merge (116, 105) to 261
merge (10, 258) to 262
merge (58, 32) to 263
merge (44, 262) to 264
merge (261, 259) to 265
merge (101, 32) to 266
merge (116, 111) to 267
merge (32, 78) to 268
merge (97, 116) to 269
merge (115, 32) to 270
merge (101, 114) to 271
merge (114, 101) to 272
merge (97, 108) to 273
merge (116, 104) to 274
merge (115, 116) to 275
merge (97, 110) to 276
merge (260, 32) to 277
merge (97, 109) to 278
merge (108, 101) to 279
merge (32, 124) to 280
merge (105, 110) to 281
merge (34, 263) to 282
merge (111, 109) to 283
merge (61, 268) to 284
merge (44, 32) to 285
merge (280, 268) to 286
merge (257, 34) to 287
merge (264, 258) to 288
merge (115, 101) to 289
merge (108, 111) to 290
merge (84, 95) to 291
merge (105, 118) to 292
merge (292, 277) to 293
merge (112, 265) to 294
merge (111, 116) to 295


In [15]:
print("tokens length: ", len(tokens))
print("new tokens length: ", len(ids))
print(f"compression rate: {len(tokens) / len(ids):.2f}X")

tokens length: 5397
new tokens length: 3365
compression rate: 1.60X


#### decoding

Given the sequence of integers [0, vocab_size], converting it into a string.

In [16]:
vocab = {idx: bytes([idx]) for idx in range(256)} # utf-8 vocab
for (p0, p1), idx in merges.items():
 vocab[idx] = vocab[p0] + vocab[p1] # adding the extra vocab tokens (256 - 296)

def decode(ids):
 bytetokens = b"".join(vocab[i] for i in ids)
 text = bytetokens.decode("utf-8", errors="replace") # if there are any errors, replace them with a question mark
 return text

print('---')
print(decode(ids))
print('length: ', len(decode(ids)))

---
Autogen enables the next-gen LLM applications with a generic [multi-agent conversation](https://microsoft.github.io/autogen/docs/Use-Cases/agent_chat) framework. It offers customizable and conversable agents that integrate LLMs, tools, and humans.
By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code.

Features of this use case include:

- **Multi-agent conversations**: AutoGen agents can communicate with each other to solve tasks. This allows for more complex and sophisticated applications than would be possible with a single LLM.
- **Customization**: AutoGen agents can be customized to meet the specific needs of an application. This includes the ability to choose the LLMs to use, the types of human input to allow, and the tools to employ.
- **Human participation**: AutoGen seamlessly allows human participation. This means that humans can provide i

#### encoding
convert the string into the tokens

In [17]:
def encode(texts):
 tokens = list(texts.encode('utf-8'))
 while len(tokens) >=2:
 stats = get_stats(tokens)
 pair = min(stats, key=lambda p: merges.get(p, float('inf'))) # selects the pair with minimum prioroty
 if pair not in merges:
 break
 idx = merges[pair]
 tokens = merge(tokens, pair, idx)
 return tokens

print(encode("hk"))

[104, 107]


*the line ensures the algorithm respects the merge priorities defined
```
pair = min(stats, key=lambda p: merges.get(p, float('inf')))
```

In [18]:
print(decode(encode(" presence_penalty ")))

 presence_penalty 
