Spaces:

mknolan
/

cursor_slides_internvl2

Paused

App Files Files Community

cursor_slides_internvl2 / README_INTERNVL2.md

mknolan

Upload InternVL2 implementation

e59dc66 verified 4 months ago

preview code

raw

history blame contribute delete

3 kB

	# Image Analysis with InternVL2

	This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis.

	## Features

	- High-Quality Image Analysis: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding
	- Multiple Analysis Types: General description, text extraction, chart analysis, people description, and technical analysis
	- Simple UI: User-friendly Gradio interface for easy image uploading and analysis
	- Efficient Resource Usage: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference

	## Requirements

	The application requires:

	- Python 3.9+
	- CUDA-compatible GPU (recommended 24GB+ VRAM)
	- Transformers 4.37.2+
	- lmdeploy 0.5.3+
	- Gradio 3.38.0
	- Other dependencies in `requirements.txt`

	## Setup

	### Docker Setup (Recommended)

	1. Build the Docker image:
	```
	docker build -t internvl2-image-analysis .
	```

	2. Run the Docker container:
	```
	docker run --gpus all -p 7860:7860 internvl2-image-analysis
	```

	### Local Setup

	1. Create a virtual environment:
	```
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate
	```

	2. Install dependencies:
	```
	pip install -r requirements.txt
	```

	3. Run the application:
	```
	python app_internvl2.py
	```

	## Usage

	1. Open your browser and navigate to `http://localhost:7860`
	2. Upload an image using the upload box
	3. Choose an analysis type from the options
	4. Click "Analyze Image" and wait for the results

	### Analysis Types

	- General: Provides a comprehensive description of the image content
	- Text: Focuses on identifying and extracting text from the image
	- Chart: Analyzes charts, graphs, and diagrams in detail
	- People: Describes people in the image - appearance, actions, and expressions
	- Technical: Provides technical analysis of objects and their relationships

	## Testing

	To test the model directly from the command line:

	```
	python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail."
	```

	## Deployment to Hugging Face

	To deploy to Hugging Face Spaces:

	```
	python upload_internvl2_to_hf.py
	```

	## Model Details

	This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of:

	- Vision Component: InternViT-6B-448px-V1-5
	- Language Component: Nous-Hermes-2-Yi-34B
	- Total Parameters: ~40B (6B vision + 34B language)

	## License

	This project is released under the same license as the InternVL2 model, which is MIT license.

	## Acknowledgements

	- [OpenGVLab](https://github.com/OpenGVLab) for creating the InternVL2 models
	- [Hugging Face](https://huggingface.co/) for model hosting
	- [lmdeploy](https://github.com/InternLM/lmdeploy) for model optimization
	- [Gradio](https://gradio.app/) for the web interface