Title: Best Practice for Handling Variable-Length Sequences in Training an LLM Model on a Chatbot Dataset
I am currently engaged in training Falcon (LLM) on a chatbot dataset, and I would appreciate some guidance on handling variable-length sequences within the dataset. The dataset consists of multiple examples of chat messages exchanged between user 1 and user 2, totaling around 500 such instances. Each example varies in the number of messages it contains, leading to differing sequence lengths.
Here are two representative data points from the dataset:
Datapoint 1 = """user 1 : How are you ?\n user 2 : I am good. \n user 1 : What do you like ? \n user 2 : Apples"""
Datapoint 2 = """user 1 : How are you ?\nuser 2 : I am good.\n user 1 : What do you like in fruits?\n user 2 : Oranges \nuser 1 : Great me too\n user 2 : But sometimes I like mangoes \nuser 1 : seems intresting \n user 2 : Yeah"""
To facilitate the training process, I tokenized the dataset, setting a maximum_length
of input_ids
to 4 tokens, and handled overflowed tokens by padding them accordingly.
Now, my question is: in cases where a chat message contains fewer than 4 tokens, what is considered a best practice? Should I pad these shorter sequences to match the maximum length, or would it be more suitable to keep them as they are?
I would appreciate any insights or suggestions on the most appropriate approach for handling variable-length sequences in this context.