01-ai
/

Yi-VL-34B

Image-Text-to-Text

PyTorch

llava

conversational

Model card Files Files and versions Community

BellaBei commited on Jan 20, 2024

Commit

8a8473d

verified ·

1 Parent(s): 24cefa3

Update Readme

Browse files

Files changed (1) hide show

README.md +20 -15

README.md CHANGED Viewed

@@ -58,6 +58,7 @@ license_link: LICENSE
   - [Showcases](#showcases)
 - [How to use Yi-VL?](#how-to-use-yi-vl)
   - [Quick start](#quick-start)
 - [Misc.](#misc)
   - [Citation](#citation)
   - [Acknowledgements and attributions](#acknowledgements-and-attributions)
@@ -74,7 +75,7 @@ license_link: LICENSE
 - Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
-- Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
 ## Models
@@ -95,7 +96,7 @@ Yi-VL offers the following features:
 - Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
-- Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
 ## Architecture
@@ -105,9 +106,9 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
 - Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
-- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
-![Yi-VL architecture]()
 ## Training
@@ -115,24 +116,22 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
 Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
-- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224&times;224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
-- Stage 2: The image resolution of ViT is scaled up to 448&times;448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs.
-- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
 Below are the parameters configured for each stage.
-Stage | Global batch size | Learning rate | Gradient clip | NO. of epochs
 |---|---|---|---|---
 Stage 1, 2 |4096|1e-4|0.5|1
 Stage 3|256|2e-5|1.0|2
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
 ### Training resource consumption
-- The training consumes 128 NVIDIA A100 GPUs.
 - The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
@@ -166,16 +165,14 @@ Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchma
 - MMMU
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
 - CMMMU
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
 ## Showcases
-Yi-VL can describe images accurately and in detail with few hallucinations.
 Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
 - English
@@ -206,6 +203,14 @@ Notes:
 - You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
 # Misc.
 ## Citation

   - [Showcases](#showcases)
 - [How to use Yi-VL?](#how-to-use-yi-vl)
   - [Quick start](#quick-start)
+  - [Hardware requirement](#hardware-requirement)
 - [Misc.](#misc)
   - [Citation](#citation)
   - [Acknowledgements and attributions](#acknowledgements-and-attributions)
 - Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
+- Yi-VL-34B is the **first** open-source 34B vision language model worldwide.
 ## Models
 - Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
+- Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448&times;448.
 ## Architecture
 - Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
+- Large Language Model (LLM): it's initialized with [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) or [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
 ## Training
 Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
+- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224&times;224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs from [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/). The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
+- Stage 2: The image resolution of ViT is scaled up to 448&times;448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs, such as [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), [CLLaVA](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions), [LLaVAR](https://llavar.github.io/), [Flickr](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [VQAv2](https://paperswithcode.com/dataset/visual-question-answering-v2-0), [RefCOCO](https://github.com/lichengunc/refer/tree/master), [Visual7w](http://ai.stanford.edu/~yukez/visual7w/) and so on.
+- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including [GQA](https://cs.stanford.edu/people/dorarad/gqa/download.html), [VizWiz VQA](https://vizwiz.org/tasks-and-datasets/vqa/), [TextCaps](https://opendatalab.com/OpenDataLab/TextCaps), [OCR-VQA](https://ocr-vqa.github.io/), [Visual Genome](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html), [LAION GPT4V](https://huggingface.co/datasets/laion/gpt4v-dataset) and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
 Below are the parameters configured for each stage.
+Stage | Global batch size | Learning rate | Gradient clip | Epochs
 |---|---|---|---|---
 Stage 1, 2 |4096|1e-4|0.5|1
 Stage 3|256|2e-5|1.0|2
 ### Training resource consumption
+- The training consumes 128 NVIDIA A800 (80G) GPUs.
 - The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
 - MMMU
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
 - CMMMU
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
 ## Showcases
 Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
 - English
 - You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
+## Hardware requirement
+For model inference, the recommended GPU examples are:
+- Yi-VL-6B: RTX 3090, RTX 4090, A10, A30
+- Yi-VL-34B: 4 &times; RTX 4090, A800 (80 GB)
 # Misc.
 ## Citation