|
# Image Analysis with InternVL2 |
|
|
|
This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis. |
|
|
|
## Features |
|
|
|
- **High-Quality Image Analysis**: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding |
|
- **Multiple Analysis Types**: General description, text extraction, chart analysis, people description, and technical analysis |
|
- **Simple UI**: User-friendly Gradio interface for easy image uploading and analysis |
|
- **Efficient Resource Usage**: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference |
|
|
|
## Requirements |
|
|
|
The application requires: |
|
|
|
- Python 3.9+ |
|
- CUDA-compatible GPU (recommended 24GB+ VRAM) |
|
- Transformers 4.37.2+ |
|
- lmdeploy 0.5.3+ |
|
- Gradio 3.38.0 |
|
- Other dependencies in `requirements.txt` |
|
|
|
## Setup |
|
|
|
### Docker Setup (Recommended) |
|
|
|
1. **Build the Docker image**: |
|
``` |
|
docker build -t internvl2-image-analysis . |
|
``` |
|
|
|
2. **Run the Docker container**: |
|
``` |
|
docker run --gpus all -p 7860:7860 internvl2-image-analysis |
|
``` |
|
|
|
### Local Setup |
|
|
|
1. **Create a virtual environment**: |
|
``` |
|
python -m venv venv |
|
source venv/bin/activate # On Windows: venv\Scripts\activate |
|
``` |
|
|
|
2. **Install dependencies**: |
|
``` |
|
pip install -r requirements.txt |
|
``` |
|
|
|
3. **Run the application**: |
|
``` |
|
python app_internvl2.py |
|
``` |
|
|
|
## Usage |
|
|
|
1. Open your browser and navigate to `http://localhost:7860` |
|
2. Upload an image using the upload box |
|
3. Choose an analysis type from the options |
|
4. Click "Analyze Image" and wait for the results |
|
|
|
### Analysis Types |
|
|
|
- **General**: Provides a comprehensive description of the image content |
|
- **Text**: Focuses on identifying and extracting text from the image |
|
- **Chart**: Analyzes charts, graphs, and diagrams in detail |
|
- **People**: Describes people in the image - appearance, actions, and expressions |
|
- **Technical**: Provides technical analysis of objects and their relationships |
|
|
|
## Testing |
|
|
|
To test the model directly from the command line: |
|
|
|
``` |
|
python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail." |
|
``` |
|
|
|
## Deployment to Hugging Face |
|
|
|
To deploy to Hugging Face Spaces: |
|
|
|
``` |
|
python upload_internvl2_to_hf.py |
|
``` |
|
|
|
## Model Details |
|
|
|
This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of: |
|
|
|
- **Vision Component**: InternViT-6B-448px-V1-5 |
|
- **Language Component**: Nous-Hermes-2-Yi-34B |
|
- **Total Parameters**: ~40B (6B vision + 34B language) |
|
|
|
## License |
|
|
|
This project is released under the same license as the InternVL2 model, which is MIT license. |
|
|
|
## Acknowledgements |
|
|
|
- [OpenGVLab](https://github.com/OpenGVLab) for creating the InternVL2 models |
|
- [Hugging Face](https://huggingface.co/) for model hosting |
|
- [lmdeploy](https://github.com/InternLM/lmdeploy) for model optimization |
|
- [Gradio](https://gradio.app/) for the web interface |