Image Analysis with InternVL2
This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis.
Features
- High-Quality Image Analysis: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding
- Multiple Analysis Types: General description, text extraction, chart analysis, people description, and technical analysis
- Simple UI: User-friendly Gradio interface for easy image uploading and analysis
- Efficient Resource Usage: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference
Requirements
The application requires:
- Python 3.9+
- CUDA-compatible GPU (recommended 24GB+ VRAM)
- Transformers 4.37.2+
- lmdeploy 0.5.3+
- Gradio 3.38.0
- Other dependencies in
requirements.txt
Setup
Docker Setup (Recommended)
Build the Docker image:
docker build -t internvl2-image-analysis .
Run the Docker container:
docker run --gpus all -p 7860:7860 internvl2-image-analysis
Local Setup
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Install dependencies:
pip install -r requirements.txt
Run the application:
python app_internvl2.py
Usage
- Open your browser and navigate to
http://localhost:7860
- Upload an image using the upload box
- Choose an analysis type from the options
- Click "Analyze Image" and wait for the results
Analysis Types
- General: Provides a comprehensive description of the image content
- Text: Focuses on identifying and extracting text from the image
- Chart: Analyzes charts, graphs, and diagrams in detail
- People: Describes people in the image - appearance, actions, and expressions
- Technical: Provides technical analysis of objects and their relationships
Testing
To test the model directly from the command line:
python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail."
Deployment to Hugging Face
To deploy to Hugging Face Spaces:
python upload_internvl2_to_hf.py
Model Details
This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of:
- Vision Component: InternViT-6B-448px-V1-5
- Language Component: Nous-Hermes-2-Yi-34B
- Total Parameters: ~40B (6B vision + 34B language)
License
This project is released under the same license as the InternVL2 model, which is MIT license.
Acknowledgements
- OpenGVLab for creating the InternVL2 models
- Hugging Face for model hosting
- lmdeploy for model optimization
- Gradio for the web interface