Image Analysis with InternVL2

This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis.

Features

High-Quality Image Analysis: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding
Multiple Analysis Types: General description, text extraction, chart analysis, people description, and technical analysis
Simple UI: User-friendly Gradio interface for easy image uploading and analysis
Efficient Resource Usage: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference

Requirements

The application requires:

Python 3.9+
CUDA-compatible GPU (recommended 24GB+ VRAM)
Transformers 4.37.2+
lmdeploy 0.5.3+
Gradio 3.38.0
Other dependencies in requirements.txt

Setup

Docker Setup (Recommended)

Build the Docker image:

docker build -t internvl2-image-analysis .

Run the Docker container:

docker run --gpus all -p 7860:7860 internvl2-image-analysis

Local Setup

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
```
python app_internvl2.py
```

Usage

Open your browser and navigate to http://localhost:7860
Upload an image using the upload box
Choose an analysis type from the options
Click "Analyze Image" and wait for the results

Analysis Types

General: Provides a comprehensive description of the image content
Text: Focuses on identifying and extracting text from the image
Chart: Analyzes charts, graphs, and diagrams in detail
People: Describes people in the image - appearance, actions, and expressions
Technical: Provides technical analysis of objects and their relationships

Testing

To test the model directly from the command line:

python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail."

Deployment to Hugging Face

To deploy to Hugging Face Spaces:

python upload_internvl2_to_hf.py

Model Details

This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of:

Vision Component: InternViT-6B-448px-V1-5
Language Component: Nous-Hermes-2-Yi-34B
Total Parameters: ~40B (6B vision + 34B language)

License

This project is released under the same license as the InternVL2 model, which is MIT license.

Acknowledgements

OpenGVLab for creating the InternVL2 models
Hugging Face for model hosting
lmdeploy for model optimization
Gradio for the web interface