Problème manifeste dans la tokenization

#3
by ParisNeo - opened

Bonjour,

J'ai téléchargé Lucie sur mon serveur et je le fait tourner avec ollama en utilisant le gguf.
J'ai l'impression que tous les tokens (excepté quelques uns) ont un espace supplémentaire. Ceci n'est pas rare en IA ça permet d'économiser des espaces, mais en revanche, c'est catastrophique quand on veut coder en python.

Lucie ajoutye systématiquement un espace avant toutes les lignes de python sauf les lignes avec commentaires car # est un token à part. ceci résulte en ceci:

Moi: Make a simple hello world in python
Lucie: Here' s an example of how you can make a simple " Hello, World!" program in Python:

 print(" Hello, World!")

This code will output the string " Hello, World!" to the console.

Moi: write “my name is ParisNeo”
Lucie: Here' s an example of how you can write the string " My Name Is Saif" in Python:

 print(" My Name Is Saif")

This code will output the string " My Name Is Saif" to the console.

Moi: write # Hello
Lucie: Here is an example of how you can write the string " Hello" in Python:

 print(" Hello")

This code will output the string " Hello" to the console.

Moi: no I mean # Hello
Lucie: Heres an example of how you can write the string "# Hello" in Python:

 print("# Hello")

This code will output the string "# Hello" to the console.

On voit que avec le # il n'ajoute pas l'espace. On voit aussi un espace avant tous les prints.

Moi: Write # Hello and add a comment before the print line
Lucie: Here' s an example that adds a comment before the print statement:

# This is a comment
 print(" Hello")

In this code, we start by adding a # symbol followed by a space to make it a single-line comment. The rest of the code will continue on its line and will be executed after the comment.

Ce code ne fonctionnera manifestement pas car il y a un espace supplémentaire avant le print.

Et j'ai testé ceci plusieurs fois. Toujours le même comportement.
Est ce que quelqu'un d'autre a remarqué ceci?

OpenLLM France org

Im not Frenchman

OpenLLM France org

For english folks @Felguk ,here is the translation :
Hello,

I downloaded Lucie on my server and I'm running it with ollama using gguf. I have the impression that all tokens (except a few) have an extra space. This is not uncommon in AI, as it allows for space savings, but on the other hand, it's catastrophic when you want to code in Python.

Lucie systematically adds a space before all Python lines, except for lines with comments, because # is a separate token. This results in the following:

Me: Make a simple hello world in python Lucie: Here's an example of how you can make a simple "Hello, World!" program in Python:

print("Hello, World!") This code will output the string "Hello, World!" to the console.

Me: write "my name is ParisNeo" Lucie: Here's an example of how you can write the string "My Name Is Saif" in Python:

print("My Name Is Saif") This code will output the string "My Name Is Saif" to the console.

Me: write # Hello Lucie: Here is an example of how you can write the string "Hello" in Python:

print("Hello") This code will output the string "Hello" to the console.

Me: no I mean # Hello Lucie: Here's an example of how you can write the string "# Hello" in Python:

print("# Hello") This code will output the string "# Hello" to the console.

As you can see, with the # symbol, it doesn't add a space. You can also see a space before all the print statements.

Me: Write # Hello and add a comment before the print line Lucie: Here's an example that adds a comment before the print statement:

This is a comment
print("Hello") In this code, we start by adding a # symbol followed by a space to make it a single-line comment. The rest of the code will continue on its line and will be executed after the comment.

This code will obviously not work because there is an extra space before the print statement.

And I've tested this several times. Always the same behavior. Has anyone else noticed this?

OpenLLM France org
edited 1 day ago

@ParisNeo
I get the same results as you.
It sometimes add an extra space.
However, please note that the model is not intended to be taken as a very powerful model.
Lucie is JUST a research model and it's only intended for making french/english sentences, not to generate code.
And currently the finetuning is light, allowing you to align it with whatever you want.

Here is the website of OpenLLM-France : https://www.openllm-france.fr
Here is the discord of OpenLLM-France : https://discord.com/invite/tZf7BR4dY7

Thanks. I agree. 7b models are not powerful enough. But they must be useful for some applications. Or it would be a waste of money and energy.

I'll definitely fine tune it. I'm excited about it being open source and ethical. But it needs to be useful and not hallucinate.

I think what went wrong here was opening a 7b model as a service. Even if it was a good model which it is not (Deepseek 1.5B is way superior), it shouldn't have been opened to the public who are used to use gpt4 and higher. That created a massive counter publicity as people got disapointed.

I think with fine tuning it can become a good 7B model useful for automating some simple tasks as inference speed can be high.

I have alot of comments about it. Things that can be easily fixed with fine tuning. Others a little more difficult to fix.

For example lucie can't write a correct linux path. It adds systematic spaces. That is bad as it does have linux instructions in its database so i expected it to help do linux stuff. Ask it hiw to remove a file in /home/me/file.txt and it will do rm / home/ me/ file.

So basically it can't even do a hello World correctly. In the age where china is cranking great small models, we must test our models alot before releasing.

I'll investigate this space thing.
This model deserves to be saved.

OpenLLM France org
edited 1 day ago

@Volko76 the spaces in the tokens are expected with Lucie.
It is a trick so that words are represented the same, either they start the string / sentence or not (this will be clarified in the technical paper to be published soonish).
So we add some spaces in the pre-processing of the tokenizer.
The problem is with the GGUF format, that they does not support the space removal from the post-processing.
I guess the norm has to be extended to support Lucie tokenize's trick.

OpenLLM France org
edited 1 day ago

@ParisNeo this is not at all the thread to discuss this. Do not mix issues, please.

I won't answer here. But if you open a new thread, I will answer and misunderstandings and political issues that you raise here, as one of the conceptor of the model.
There were a lot of misunderstanding because of the press spreading wrong information, like this being the final result for an education project.
Actually that project just started last week. Lucie-7B (not instruct) was just a first milestones.
Alignement and instruct-tuned hasn't really started yet.
So please, don't judge quickly, unless you spoke with people who are aware of what's going on. Thanks

OpenLLM France org

@Volko76 the spaces in the tokens are expected with Lucie.
It is a trick so that words are represented the same, either they start the string / sentence or not.
So we add some spaces in the pre-processing of the tokenizer.
The problem is with the GGUF format, that they does not support the space removal from the post-processing.
I guess the norm has to be extended to support Lucie tokenize's trick.

Oh okay, interesting

@ParisNeo this is not at all the thread to discuss this. Do not mix issues, please.

I won't answer here. But if you open a new thread, I will answer and misunderstandings and political issues that you raise here, as one of the conceptor of the model.
There were a lot of misunderstanding because of the press spreading wrong information, like this being the final result for an education project.
Actually that project just started last week. Lucie-7B (not instruct) was just a first milestones.
Alignement and instruct-tuned hasn't really started yet.
So please, don't judge quickly, unless you spoke with people who are aware of what's going on. Thanks

Sorry. I did not intend to talk politics. I just remarked the spaces in gguf.

@Jeronymous thanks for the clarification.
I'll try to play with it a bit. Maybe start from the non instruct model and see what i can do.
Have a nice day.

@Jeronymous You are right about the spaces, I took a look at the transformers version and indeed the tokenizer does have prepend with spaces and then a list of replace as postprocess.

I used transformers library to load and test the model using a pipeline and it does work.

Unfortunately, when I ask the ai to build some thing that contains a string with a space at the beginning it automatically removes the space.

Here is an example

<|start_header_id|>system<|end_header_id|>

you are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

in python print this exact string " Hello "<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Sure, here is the code in python that will do that:

print("Hello")

This code will output the exact string "Hello" on the console.

Here the answer is correct except that it did "Hello" instead of " Hello ". I guess this is a side effect since for it these are the same.

Sign up or log in to comment