TeLVE: Turkish efficient Language Vision Engine 🧿
First Turkish VLM ever!
TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing. No module named 'imagine'
Model Description
TeLVE combines:
- 🖼️ Vision Transformer (ViT-base-patch16-224)
- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
- 🔄 Cross-attention mechanism for vision-language fusion
Version Logs
- TeLVE v1.0: Trained on Unsplash Lite dataset
- TeLVE v1.0dep: Dataset enhanced with selective images from Pexels images, the encoder problem with letter "ü" was fixed. (Deprecated, performance was decreased because of dataset addressing problem. Not recommended to use.)
Usage
The model can be used in two ways:
Inference (imagine.py)
# Generate captions for images
python imagine.py
This script:
- Loads a trained TeLVE model
- Takes images from
images
directory - Generates Turkish captions for each image
- Outputs the results to console
Training (main.py)
Users can train their own models with ViT and BERT encoders.
# Train a new model
python main.py
This script:
- Loads and preprocesses image-caption pairs
- Initializes ViT and BERT encoders
- Trains the combined model
- Saves the model and tokenizer
Performance
Performance scores will be evaluated.
Citation
@software{telve2024,
author = {Öğüt Su Karagün},
title = {TeLVE: Turkish efficient Language Vision Engine},
year = {2024},
url = {https://huggingface.co/outsu/TeLVE}
}
License
TeLVE © 2024 by Öğüt Su Karagün is licensed under Creative Commons Attribution 4.0 International