czczup commited on
Commit
6a28ca8
1 Parent(s): cb4dd67

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +253 -13
  2. modeling_intern_vit.py +6 -13
README.md CHANGED
@@ -62,6 +62,8 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
62
  | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
63
  | OpenCompass<sub>avg</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
64
 
 
 
65
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
66
 
67
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
@@ -300,7 +302,7 @@ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast
300
 
301
  # set the max number of tiles in `max_num`
302
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
303
- generation_config = dict(max_new_tokens=1024, do_sample=False)
304
 
305
  # pure-text conversation (纯文本对话)
306
  question = 'Hello, who are you?'
@@ -452,21 +454,140 @@ for new_text in streamer:
452
 
453
  ## Finetune
454
 
455
- SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
456
 
457
  ## Deployment
458
 
459
  ### LMDeploy
460
 
461
- > Warning: This model is not yet supported by LMDeploy.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
462
 
463
- ### vLLM
464
 
465
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
466
 
467
- ### Ollama
468
 
469
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
470
 
471
  ## License
472
 
@@ -540,6 +661,8 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
540
  | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
541
  | OpenCompass<sub>avg</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
542
 
 
 
543
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
544
 
545
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
@@ -598,21 +721,138 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
598
 
599
  ## 微调
600
 
601
- 来自ModelScope社区的SWIFT已经支持对InternVL进行微调(图像/视频),详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)
602
 
603
  ## 部署
604
 
605
  ### LMDeploy
606
 
607
- > 注意:此模型尚未被 LMDeploy 支持。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
608
 
609
- ### vLLM
610
 
611
- TODO
612
 
613
- ### Ollama
 
 
 
 
 
 
 
 
 
 
 
 
 
 
614
 
615
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
616
 
617
  ## 开源许可证
618
 
 
62
  | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
63
  | OpenCompass<sub>avg</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
64
 
65
+ - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
66
+
67
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
68
 
69
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
 
302
 
303
  # set the max number of tiles in `max_num`
304
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
305
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
306
 
307
  # pure-text conversation (纯文本对话)
308
  question = 'Hello, who are you?'
 
454
 
455
  ## Finetune
456
 
457
+ Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
458
 
459
  ## Deployment
460
 
461
  ### LMDeploy
462
 
463
+ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
464
+
465
+ ```sh
466
+ pip install lmdeploy==0.5.3
467
+ ```
468
+
469
+ LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
470
+
471
+ #### A 'Hello, world' example
472
+
473
+ ```python
474
+ from lmdeploy import pipeline, TurbomindEngineConfig
475
+ from lmdeploy.vl import load_image
476
+
477
+ model = 'OpenGVLab/InternVL2-1B'
478
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
479
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
480
+ response = pipe(('describe this image', image))
481
+ print(response.text)
482
+ ```
483
+
484
+ If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
485
+
486
+ #### Multi-images inference
487
+
488
+ When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
489
+
490
+ > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
491
+
492
+ ```python
493
+ from lmdeploy import pipeline, TurbomindEngineConfig
494
+ from lmdeploy.vl import load_image
495
+ from lmdeploy.vl.constants import IMAGE_TOKEN
496
+
497
+ model = 'OpenGVLab/InternVL2-1B'
498
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
499
+
500
+ image_urls=[
501
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
502
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
503
+ ]
504
+
505
+ images = [load_image(img_url) for img_url in image_urls]
506
+ # Numbering images improves multi-image conversations
507
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
508
+ print(response.text)
509
+ ```
510
+
511
+ #### Batch prompts inference
512
 
513
+ Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
514
 
515
+ ```python
516
+ from lmdeploy import pipeline, TurbomindEngineConfig
517
+ from lmdeploy.vl import load_image
518
+
519
+ model = 'OpenGVLab/InternVL2-1B'
520
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
521
+
522
+ image_urls=[
523
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
524
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
525
+ ]
526
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
527
+ response = pipe(prompts)
528
+ print(response)
529
+ ```
530
+
531
+ #### Multi-turn conversation
532
+
533
+ There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
534
+
535
+ ```python
536
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
537
+ from lmdeploy.vl import load_image
538
+
539
+ model = 'OpenGVLab/InternVL2-1B'
540
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
541
+
542
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
543
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
544
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
545
+ print(sess.response.text)
546
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
547
+ print(sess.response.text)
548
+ ```
549
+
550
+ #### Service
551
 
552
+ LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
553
 
554
+ ```shell
555
+ lmdeploy serve api_server OpenGVLab/InternVL2-1B --backend turbomind --server-port 23333
556
+ ```
557
+
558
+ To use the OpenAI-style interface, you need to install OpenAI:
559
+
560
+ ```shell
561
+ pip install openai
562
+ ```
563
+
564
+ Then, use the code below to make the API call:
565
+
566
+ ```python
567
+ from openai import OpenAI
568
+
569
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
570
+ model_name = client.models.list().data[0].id
571
+ response = client.chat.completions.create(
572
+ model=model_name,
573
+ messages=[{
574
+ 'role':
575
+ 'user',
576
+ 'content': [{
577
+ 'type': 'text',
578
+ 'text': 'describe this image',
579
+ }, {
580
+ 'type': 'image_url',
581
+ 'image_url': {
582
+ 'url':
583
+ 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
584
+ },
585
+ }],
586
+ }],
587
+ temperature=0.8,
588
+ top_p=0.8)
589
+ print(response)
590
+ ```
591
 
592
  ## License
593
 
 
661
  | MathVista<sub>testmini</sub> | 28.7 | 41.1 | 46.3 | 37.7 |
662
  | OpenCompass<sub>avg</sub> | 46.6 | 49.8 | 54.0 | 48.3 |
663
 
664
+ - 关于更多的细节以及评测复现,请看我们的[评测指南](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)。
665
+
666
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
667
 
668
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
 
721
 
722
  ## 微调
723
 
724
+ 许多仓库现在都支持 InternVL 系列模型的微调,包括 [InternVL](https://github.com/OpenGVLab/InternVL)、[SWIFT](https://github.com/modelscope/ms-swift)、[XTurner](https://github.com/InternLM/xtuner) 等。请参阅它们的文档以获取更多微调细节。
725
 
726
  ## 部署
727
 
728
  ### LMDeploy
729
 
730
+ LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
731
+
732
+ ```sh
733
+ pip install lmdeploy==0.5.3
734
+ ```
735
+
736
+ LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM)的推理管道。
737
+
738
+ #### 一个“你好,世界”示例
739
+
740
+ ```python
741
+ from lmdeploy import pipeline, TurbomindEngineConfig
742
+ from lmdeploy.vl import load_image
743
+
744
+ model = 'OpenGVLab/InternVL2-1B'
745
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
746
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
747
+ response = pipe(('describe this image', image))
748
+ print(response.text)
749
+ ```
750
+
751
+ 如果在执行此示例时出现 `ImportError`,请按照提示安装所需的依赖包。
752
+
753
+ #### 多图像推理
754
+
755
+ 在处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入 token 数量增加,因此通常需要增加上下文窗口的大小。
756
+
757
+ ```python
758
+ from lmdeploy import pipeline, TurbomindEngineConfig
759
+ from lmdeploy.vl import load_image
760
+ from lmdeploy.vl.constants import IMAGE_TOKEN
761
+
762
+ model = 'OpenGVLab/InternVL2-1B'
763
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
764
+
765
+ image_urls=[
766
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
767
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
768
+ ]
769
+
770
+ images = [load_image(img_url) for img_url in image_urls]
771
+ # Numbering images improves multi-image conversations
772
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
773
+ print(response.text)
774
+ ```
775
 
776
+ #### 批量Prompt推理
777
 
778
+ 使用批量Prompt进行推理非常简单;只需将它们放在一个列表结构中:
779
 
780
+ ```python
781
+ from lmdeploy import pipeline, TurbomindEngineConfig
782
+ from lmdeploy.vl import load_image
783
+
784
+ model = 'OpenGVLab/InternVL2-1B'
785
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
786
+
787
+ image_urls=[
788
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
789
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
790
+ ]
791
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
792
+ response = pipe(prompts)
793
+ print(response)
794
+ ```
795
 
796
+ #### 多轮对话
797
+
798
+ 使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法,另一种是使用 `pipeline.chat` 接口。
799
+
800
+ ```python
801
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
802
+ from lmdeploy.vl import load_image
803
+
804
+ model = 'OpenGVLab/InternVL2-1B'
805
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
806
+
807
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
808
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
809
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
810
+ print(sess.response.text)
811
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
812
+ print(sess.response.text)
813
+ ```
814
+
815
+ #### API部署
816
+
817
+ LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
818
+
819
+ ```shell
820
+ lmdeploy serve api_server OpenGVLab/InternVL2-1B --backend turbomind --server-port 23333
821
+ ```
822
+
823
+ 为了使用OpenAI风格的API接口,您需要安装OpenAI:
824
+
825
+ ```shell
826
+ pip install openai
827
+ ```
828
+
829
+ 然后,使用下面的代码进行API调用:
830
+
831
+ ```python
832
+ from openai import OpenAI
833
+
834
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
835
+ model_name = client.models.list().data[0].id
836
+ response = client.chat.completions.create(
837
+ model=model_name,
838
+ messages=[{
839
+ 'role':
840
+ 'user',
841
+ 'content': [{
842
+ 'type': 'text',
843
+ 'text': 'describe this image',
844
+ }, {
845
+ 'type': 'image_url',
846
+ 'image_url': {
847
+ 'url':
848
+ 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
849
+ },
850
+ }],
851
+ }],
852
+ temperature=0.8,
853
+ top_p=0.8)
854
+ print(response)
855
+ ```
856
 
857
  ## 开源许可证
858
 
modeling_intern_vit.py CHANGED
@@ -15,24 +15,17 @@ from transformers.activations import ACT2FN
15
  from transformers.modeling_outputs import (BaseModelOutput,
16
  BaseModelOutputWithPooling)
17
  from transformers.modeling_utils import PreTrainedModel
18
- from transformers.utils.import_utils import is_flash_attn_greater_or_equal
19
  from transformers.utils import logging
20
 
21
  from .configuration_intern_vit import InternVisionConfig
22
 
23
  try:
24
- if is_flash_attn_greater_or_equal("2.0.0"):
25
- from flash_attn.flash_attn_interface import \
26
- flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
27
- else:
28
- from flash_attn.flash_attn_interface import \
29
- flash_attn_unpadded_qkvpacked_func
30
-
31
  from flash_attn.bert_padding import pad_input, unpad_input
32
-
 
33
  has_flash_attn = True
34
  except:
35
- print('FlashAttention is not installed.')
36
  has_flash_attn = False
37
 
38
  logger = logging.get_logger(__name__)
@@ -75,7 +68,7 @@ class FlashAttention(nn.Module):
75
  max_s = seqlen
76
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
77
  device=qkv.device)
78
- output = flash_attn_unpadded_qkvpacked_func(
79
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
80
  softmax_scale=self.softmax_scale, causal=causal
81
  )
@@ -85,7 +78,7 @@ class FlashAttention(nn.Module):
85
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
86
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
87
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
88
- output_unpad = flash_attn_unpadded_qkvpacked_func(
89
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
90
  softmax_scale=self.softmax_scale, causal=causal
91
  )
@@ -94,7 +87,7 @@ class FlashAttention(nn.Module):
94
  'b s (h d) -> b s h d', h=nheads)
95
  else:
96
  assert max_s is not None
97
- output = flash_attn_unpadded_qkvpacked_func(
98
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
99
  softmax_scale=self.softmax_scale, causal=causal
100
  )
 
15
  from transformers.modeling_outputs import (BaseModelOutput,
16
  BaseModelOutputWithPooling)
17
  from transformers.modeling_utils import PreTrainedModel
 
18
  from transformers.utils import logging
19
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
 
 
 
 
 
 
 
23
  from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_varlen_qkvpacked_func
26
  has_flash_attn = True
27
  except:
28
+ print('FlashAttention2 is not installed.')
29
  has_flash_attn = False
30
 
31
  logger = logging.get_logger(__name__)
 
68
  max_s = seqlen
69
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
70
  device=qkv.device)
71
+ output = flash_attn_varlen_qkvpacked_func(
72
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
73
  softmax_scale=self.softmax_scale, causal=causal
74
  )
 
78
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
79
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
80
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
81
+ output_unpad = flash_attn_varlen_qkvpacked_func(
82
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
83
  softmax_scale=self.softmax_scale, causal=causal
84
  )
 
87
  'b s (h d) -> b s h d', h=nheads)
88
  else:
89
  assert max_s is not None
90
+ output = flash_attn_varlen_qkvpacked_func(
91
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
92
  softmax_scale=self.softmax_scale, causal=causal
93
  )