|
--- |
|
language: |
|
- fa |
|
metrics: |
|
- wer |
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
A Persian image captioning model constructed from a ViT + RoBERTa architecture trained on flickr30k-fa. |
|
The encoder (ViT) was initialized from https://huggingface.co/google/vit-base-patch16-224 and the decoder (RoBERTa) was initialized |
|
from https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base . |
|
|
|
## Usage |
|
``` |
|
pip install hezar |
|
``` |
|
```python |
|
from hezar import Model |
|
|
|
model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k") |
|
captions = model.predict("example_image.jpg") |
|
print(captions) |
|
``` |