Word2Bezbar: Word2Vec Models for French Rap Lyrics
Overview
Word2Bezbar are Word2Vec models trained on french rap lyrics sourced from Genius. Tokenization has been done using NLTK french word_tokenze
function, with a prior processing to remove french oral contractions. Used dataset size was 323MB, corresponding to 77M tokens.
The model captures the semantic relationships between words in the context of french rap, providing a useful tool for studies associated to french slang and music lyrics analysis.
Model Details
Size of this model is medium
Parameter | Value |
---|---|
Dimensionality | 200 |
Window Size | 10 |
Epochs | 20 |
Algorithm | CBOW |
Versions
This model has been trained with the followed software versions
Requirement | Version |
---|---|
Python | 3.8.5 |
Gensim library | 4.3.2 |
NTLK library | 3.8.1 |
Installation
Install Required Python Libraries:
pip install gensim
Clone the Repository:
git clone https://github.com/rapminerz/Word2Bezbar-medium.git
Navigate to the Model Directory:
cd Word2Bezbar-medium
Loading the Model
To load the Word2Bezbar Word2Vec model, use the following Python code:
import gensim
# Load the Word2Vec model
model = gensim.models.Word2Vec.load("word2vec.model")
Using the Model
Once the model is loaded, you can use it as shown:
- To get the most similary words regarding a word
model.wv.most_similar("bendo")
[('binks', 0.7833775877952576),
('bando', 0.7511972188949585),
('tieks', 0.7123318910598755),
('ghetto', 0.6887569427490234),
('hall', 0.679759681224823),
('barrio', 0.6694452166557312),
('hood', 0.6490002274513245),
('block', 0.6299082040786743),
('bloc', 0.627208411693573),
('secteur', 0.6225507855415344)]
model.wv.most_similar("kichta")
[('liasse', 0.7877408266067505),
('sse-lia', 0.7605615854263306),
('kishta', 0.7043415904045105),
('kich', 0.663270890712738),
('sacoche', 0.6381840705871582),
('moula', 0.6318666338920593),
('valise', 0.5628494024276733),
('bonbonne', 0.55326247215271),
('skalape', 0.5523083806037903),
('kichtas', 0.5385912656784058)]
- To find the word that doesn't match in a list of words
model.wv.doesnt_match(["racli","gow","gadji","fimbi","boug"])
'boug'
model.wv.doesnt_match(["Zidane","Mbappé","Ronaldo","Messi","Jordan"])
'Jordan'
- To find the similarity between two words
model.wv.similarity("kichta", "moula")
0.63186663
model.wv.similarity("bonheur", "moula")
0.14551902
- Or even get the vector representation of a word
model.wv['ekip']
array([ 1.4757039e-01, ... 1.1260221e+00],
dtype=float32)
Purpose and Disclaimer
This model is designed for academic and research purposes only. It is not intended for commercial use. The creators of this model do not endorse or promote any specific views or opinions that may be represented in the dataset.
Please mention @RapMinerz if you use our models
Contact
For any questions or issues, please contact the repository owner, RapMinerz, at [email protected].
- Downloads last month
- 3