File size: 1,122 Bytes
c415380 5d3ed66 c415380 5d3ed66 c415380 5d3ed66 c415380 5d3ed66 c415380 5d3ed66 c415380 5d3ed66 c415380 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
---
license: apache-2.0
pipeline_tag: visual-question-answering
---
# Building Your Own Multimodal Large Model from Scratch
For the Chinese version of the README, please refer to [δΈζζζ‘£](README_zh.md).
## Model Architecture π€
In the VLM (Visual Language Model), the visual component utilizes the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the `forward` method of the `QWenModel`, the corresponding `image` tokens are replaced with visual features.
## GitHub Repository π
The code for running the model can be found at [Basic-Visual-Language-Model](https://github.com/xinyanghuang7/Basic-Visual-Language-Model/tree/main).
## References π
Special thanks to the following projects for their great work π:
- https://github.com/WatchTower-Liu/VLM-learning/tree/main
- https://github.com/QwenLM/Qwen
- https://github.com/haotian-liu/LLaVA
## Contact β
If you have any questions or ideas, feel free to reach out to me π:
[email protected]
I will respond as soon as I see your email! |