bczhou
/

tiny-llava-v1-hf

Image-Text-to-Text

vision-language

Inference Endpoints

Model card Files Files and versions Community

bczhou commited on Jan 14

Commit

6846de5

•

1 Parent(s): 543bb42

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -10,7 +10,6 @@ library_name: transformers
 ---
 # WORK IN PROGRESS
 We present TinyLLaVA, a small vision-language chatbot (1.4B) that reaches comparable performances with contemporary vision language models on common benchmarks, using less parameters.
 TinyLLaVA was trained by finetuning [TinyLlama](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) on the [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) dataset, following the training recipe of [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). For more details, please refer to the [LLaVA-1.5 paper](https://arxiv.org/abs/2310.03744).
@@ -32,13 +31,16 @@ We have evaluated TinyLLaVA on [GQA](https://cs.stanford.edu/people/dorarad/gqa/
 More evaluations are ongoing.
-## How to use the model
-First, make sure to have `transformers >= 4.35.3`.
-The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` to the location where you want to query images:
-### Using `pipeline`:
 Below we used [`"bczhou/tiny-llava-v1-hf"`](https://huggingface.co/bczhou/tiny-llava-v1-hf) checkpoint.
 ```python
@@ -56,7 +58,6 @@ print(outputs[0])
 ```
 ### Using pure `transformers`:
 Below is an example script to run generation in `float16` precision on a GPU device:
 ```python
@@ -80,5 +81,4 @@ print(processor.decode(output[0][2:], skip_special_tokens=True))
 ```
 ## Contact
 This model was trained by [Baichuan Zhou](https://baichuanzhou.github.io/), from Beihang Univerisity, under the supervision of  [Prof. Lei Huang](https://huangleibuaa.github.io/).

 ---
 # WORK IN PROGRESS
 We present TinyLLaVA, a small vision-language chatbot (1.4B) that reaches comparable performances with contemporary vision language models on common benchmarks, using less parameters.
 TinyLLaVA was trained by finetuning [TinyLlama](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) on the [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) dataset, following the training recipe of [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). For more details, please refer to the [LLaVA-1.5 paper](https://arxiv.org/abs/2310.03744).
 More evaluations are ongoing.
+## Model Preparations
+### Transformers Version
+Make sure to have `transformers >= 4.35.3`.
+### Prompt Template
+The model supports multi-image and multi-prompt generation. When using the model, make sure to follow the correct prompt template (`USER: <image>xxx\nASSISTANT:`), where `<image>` token is a place-holding special token for image embeddings.
+## Model Inference from `pipeline` and `transformers`
+### Using `pipeline`:
 Below we used [`"bczhou/tiny-llava-v1-hf"`](https://huggingface.co/bczhou/tiny-llava-v1-hf) checkpoint.
 ```python
 ```
 ### Using pure `transformers`:
 Below is an example script to run generation in `float16` precision on a GPU device:
 ```python
 ```
 ## Contact
 This model was trained by [Baichuan Zhou](https://baichuanzhou.github.io/), from Beihang Univerisity, under the supervision of  [Prof. Lei Huang](https://huangleibuaa.github.io/).