Spaces:

mknolan
/

cursor_slides_internvl2

Paused

File size: 2,996 Bytes

e59dc66

# Image Analysis with InternVL2

This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis.

## Features

- **High-Quality Image Analysis**: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding
- **Multiple Analysis Types**: General description, text extraction, chart analysis, people description, and technical analysis
- **Simple UI**: User-friendly Gradio interface for easy image uploading and analysis
- **Efficient Resource Usage**: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference

## Requirements

The application requires:

- Python 3.9+
- CUDA-compatible GPU (recommended 24GB+ VRAM)
- Transformers 4.37.2+
- lmdeploy 0.5.3+
- Gradio 3.38.0
- Other dependencies in `requirements.txt`

## Setup

### Docker Setup (Recommended)

1. **Build the Docker image**:
   ```
   docker build -t internvl2-image-analysis .
   ```

2. **Run the Docker container**:
   ```
   docker run --gpus all -p 7860:7860 internvl2-image-analysis
   ```

### Local Setup

1. **Create a virtual environment**:
   ```
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```

2. **Install dependencies**:
   ```
   pip install -r requirements.txt
   ```

3. **Run the application**:
   ```
   python app_internvl2.py
   ```

## Usage

1. Open your browser and navigate to `http://localhost:7860`
2. Upload an image using the upload box
3. Choose an analysis type from the options
4. Click "Analyze Image" and wait for the results

### Analysis Types

- **General**: Provides a comprehensive description of the image content
- **Text**: Focuses on identifying and extracting text from the image
- **Chart**: Analyzes charts, graphs, and diagrams in detail
- **People**: Describes people in the image - appearance, actions, and expressions
- **Technical**: Provides technical analysis of objects and their relationships

## Testing

To test the model directly from the command line:

```
python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail."
```

## Deployment to Hugging Face

To deploy to Hugging Face Spaces:

```
python upload_internvl2_to_hf.py
```

## Model Details

This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of:

- **Vision Component**: InternViT-6B-448px-V1-5
- **Language Component**: Nous-Hermes-2-Yi-34B
- **Total Parameters**: ~40B (6B vision + 34B language)

## License

This project is released under the same license as the InternVL2 model, which is MIT license.

## Acknowledgements

- [OpenGVLab](https://github.com/OpenGVLab) for creating the InternVL2 models
- [Hugging Face](https://huggingface.co/) for model hosting
- [lmdeploy](https://github.com/InternLM/lmdeploy) for model optimization
- [Gradio](https://gradio.app/) for the web interface