File size: 2,996 Bytes
e59dc66 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# Image Analysis with InternVL2
This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis.
## Features
- **High-Quality Image Analysis**: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding
- **Multiple Analysis Types**: General description, text extraction, chart analysis, people description, and technical analysis
- **Simple UI**: User-friendly Gradio interface for easy image uploading and analysis
- **Efficient Resource Usage**: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference
## Requirements
The application requires:
- Python 3.9+
- CUDA-compatible GPU (recommended 24GB+ VRAM)
- Transformers 4.37.2+
- lmdeploy 0.5.3+
- Gradio 3.38.0
- Other dependencies in `requirements.txt`
## Setup
### Docker Setup (Recommended)
1. **Build the Docker image**:
```
docker build -t internvl2-image-analysis .
```
2. **Run the Docker container**:
```
docker run --gpus all -p 7860:7860 internvl2-image-analysis
```
### Local Setup
1. **Create a virtual environment**:
```
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
2. **Install dependencies**:
```
pip install -r requirements.txt
```
3. **Run the application**:
```
python app_internvl2.py
```
## Usage
1. Open your browser and navigate to `http://localhost:7860`
2. Upload an image using the upload box
3. Choose an analysis type from the options
4. Click "Analyze Image" and wait for the results
### Analysis Types
- **General**: Provides a comprehensive description of the image content
- **Text**: Focuses on identifying and extracting text from the image
- **Chart**: Analyzes charts, graphs, and diagrams in detail
- **People**: Describes people in the image - appearance, actions, and expressions
- **Technical**: Provides technical analysis of objects and their relationships
## Testing
To test the model directly from the command line:
```
python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail."
```
## Deployment to Hugging Face
To deploy to Hugging Face Spaces:
```
python upload_internvl2_to_hf.py
```
## Model Details
This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of:
- **Vision Component**: InternViT-6B-448px-V1-5
- **Language Component**: Nous-Hermes-2-Yi-34B
- **Total Parameters**: ~40B (6B vision + 34B language)
## License
This project is released under the same license as the InternVL2 model, which is MIT license.
## Acknowledgements
- [OpenGVLab](https://github.com/OpenGVLab) for creating the InternVL2 models
- [Hugging Face](https://huggingface.co/) for model hosting
- [lmdeploy](https://github.com/InternLM/lmdeploy) for model optimization
- [Gradio](https://gradio.app/) for the web interface |