Image-Text-to-Text
Transformers
Safetensors
English
Chinese
llava
vision-language
llm
lmm
conversational
Inference Endpoints
bczhou commited on
Commit
543bb42
1 Parent(s): 41071f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -11,9 +11,9 @@ library_name: transformers
11
 
12
  # WORK IN PROGRESS
13
 
14
- ## Model type
15
- TinyLLaVA, a tiny model (1.4B) trained using the exact training recipe of [LLaVA-1.5](https://github.com/haotian-liu/LLaVA).
16
- We trained our TinyLLaVA using [TinyLlama](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) as our LLM backbone, and [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) as our vision backbone.
17
 
18
  ## Model Performance
19
  We have evaluated TinyLLaVA on [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html), [VizWiz](https://www.vizwiz.com/), [VQAv2](https://visualqa.org/), [TextVQA](https://textvqa.org/) and [SQA](https://github.com/lupantech/ScienceQA).
@@ -31,8 +31,6 @@ We have evaluated TinyLLaVA on [GQA](https://cs.stanford.edu/people/dorarad/gqa/
31
 
32
  More evaluations are ongoing.
33
 
34
- ## Model use
35
- The weights have been converted to hf format.
36
 
37
  ## How to use the model
38
 
@@ -79,4 +77,8 @@ raw_image = Image.open(requests.get(image_file, stream=True).raw)
79
  inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
80
  output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
81
  print(processor.decode(output[0][2:], skip_special_tokens=True))
82
- ```
 
 
 
 
 
11
 
12
  # WORK IN PROGRESS
13
 
14
+ We present TinyLLaVA, a small vision-language chatbot (1.4B) that reaches comparable performances with contemporary vision language models on common benchmarks, using less parameters.
15
+ TinyLLaVA was trained by finetuning [TinyLlama](https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3) on the [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) dataset, following the training recipe of [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). For more details, please refer to the [LLaVA-1.5 paper](https://arxiv.org/abs/2310.03744).
16
+
17
 
18
  ## Model Performance
19
  We have evaluated TinyLLaVA on [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html), [VizWiz](https://www.vizwiz.com/), [VQAv2](https://visualqa.org/), [TextVQA](https://textvqa.org/) and [SQA](https://github.com/lupantech/ScienceQA).
 
31
 
32
  More evaluations are ongoing.
33
 
 
 
34
 
35
  ## How to use the model
36
 
 
77
  inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
78
  output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
79
  print(processor.decode(output[0][2:], skip_special_tokens=True))
80
+ ```
81
+
82
+ ## Contact
83
+
84
+ This model was trained by [Baichuan Zhou](https://baichuanzhou.github.io/), from Beihang Univerisity, under the supervision of [Prof. Lei Huang](https://huangleibuaa.github.io/).