GPT2-Medium-BG 2021

GPT2-Medium 345M for Bulgarian
The model was created and trained from scratch, using tensorflow in free Google Colab T4. The research experiment started in June 2021, the video explanation is uploaded on 17.9.2024.
It is supposed to be run with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, except one with unknown size, which was demoed for a few seconds in a video in Linkedin* (more in a footnote)
A method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.
The dataset was quite small with a maximum of about 140 MB? UTF8, includes some words and texts in other languages within the text (thus more than 70M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).

Dataset, preprocessing and training

Various selected books from "Chitanka" with some cleaning of the marking in the books of notes and footnotes ([34]... etc.), ids, link of the books etc.
Various works, books, publications and texts, written by the author himself, the biggest file of them was the current version of the draft of a big survey book on AGI & Transhumanism, also several other interdisciplinary books. Many articles and works from the e-zine "The Sacred Computer" (Свещеният сметач).
Some poetry by Hristo Botev
A few articles about computers from forums and web pages, a bit in Bulgarian, some machine translated from English to Bulgarian.
Some articles from magazines on political topics
Some of the titles of the files can be seen in the training video
During training the dataset and its sampling were incrementally updated or changed after observing the generations and I recognized the source of the style of the outputs. E.g. some books seemed "poisonous" with their patterns and were reduced or removed, e.g. I.Asimov's extensive repetition of the characters' names.
Some items were added, others removed, some smaller documents were consumed multiple times, a shorter random section selected from items which were too big, different section in each iteration etc.
Due to the usage of a free Colab Notebook and limited range of uninterruptable hours, maybe up to 3 hours or so, sometimes less, occasionally a few hours more, with unknown end, it was impossible to perform a complete iteration on the entire dataset in one part (it may become impossible to fit at once too big dataset due to RAM as well).
For that reason the individual training iterations were slicing and shuffling the text, e.g. picking say 200K characters from each long document from the beginning, then from the end, or first half, then second half; or randomly etc. Smaller documents were usually "ingested" completely.
For the training: see the video instructions, as the notebook has cells which are not cleaned and shouldn't be always invoked. There was also an updated version due to discovered incompatibility of the initial one.
Some data "augmentation" - changes of names, besides the removal of repetitive patterns;
As the dataset was dynamically changed and there appeared unknown special characters here and there, there were issues with the tokens as some were missing, which resulted in errors in the preparation of the dataset by tensorflow. This was worked around with a hack that was ignoring these fragments as I didn't want to start from scratch.
In the instruction viedo a few hyperparameters can be seen, which were used in some late parts of the process: BLOCK_SIZE = 160; BUFFER_SIZE = 3200

Links

https://github.com/Twenkid/GPT2-Bulgarian-Training-Tips-and-Tools
GPT2 Unlimited-Length Generation with Hidden Prompt Injections - Code Review: https://youtu.be/V1eO2OpsXBE
The original video tutorial, 2021: https://youtu.be/F-Xt-cK4L-g
Video tutorial about the training, updates: https://youtu.be/V1eO2OpsXBE
Code for unlimited-length multi-step generations with gradual change of the topic: https://github.com/Twenkid/GPT2-Bulgarian-Training-Tips-and-Tools/blob/main/tools/gen_comments-1-2023-clean.py
In 2024 I tried to convert it to GGUF/llama.cpp, but it had incompatibilities, possibly with the converted, but I didn't resolve it: I have created it with a few tokens less which tried to fix/hack from the converter, but it still didn't complete the conversion. https://github.com/ggerganov/ggml/issues/745

The Sacred Computer: Thinking Machines, Creativity and Human Development

https://github.com/twenkid

...

Other Bulgarian autoregresssive models: an earlier one was a few seconds display of a generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely and unreasonable, they just showed that they can train a model, a team of 3 people, only one of them a ML engineer. There are a few surviving records: my blog post: https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html and info here: https://www.eu-startups.com/directory/baihui-ai/ The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice even 4 years later and even the Bulgarian Academy of Science 2023 model was the small one. I found several other GPT2-SMALL models trained later than this one here, one for poetry, the BAS' from 2023 and maybe a few others. I couldn't get info from the ML engineer of the BAIHUI project M.V.