title: Image Description with Qwen-VL
emoji: 🖼️
colorFrom: indigo
colorTo: purple
sdk: docker
sdk_version: 3.0.0
app_file: app.py
pinned: false
Image Description Application with Qwen-VL
This application uses the advanced Qwen-VL-Chat vision language model to generate detailed descriptions for images. It's specifically set up to describe the image in the data_temp
folder, but can also analyze any uploaded image.
Features
- Loads an image from the data_temp folder or via upload
- Generates multiple types of descriptions using state-of-the-art AI:
- Basic description (brief overview)
- Detailed analysis (comprehensive description)
- Technical analysis (assessment of technical aspects)
- Displays the image (optional)
- Uses 8-bit quantization for efficient model loading
- Provides a user-friendly Gradio UI
Requirements
- Python 3.8 or higher
- PyTorch
- Transformers (version 4.35.2+)
- Pillow
- Matplotlib
- Accelerate
- Bitsandbytes
- Safetensors
- Gradio for the web interface
Hardware Requirements
This application uses a vision-language model which requires:
- A CUDA-capable GPU with at least 8GB VRAM
- 8GB+ system RAM
Deployment Options
1. Hugging Face Spaces (Recommended)
This repository is ready to be deployed on Hugging Face Spaces.
Steps:
- Create a new Space on Hugging Face Spaces
- Select "Docker" as the Space SDK
- Link this GitHub repository
- Select a GPU (T4 or better is recommended)
- Create the Space
The application will automatically deploy with the Gradio UI frontend.
2. AWS SageMaker
For production deployment on AWS SageMaker:
- Package the application using the provided Dockerfile
- Upload the Docker image to Amazon ECR
- Create a SageMaker Model using the ECR image
- Deploy an endpoint with an instance type like ml.g4dn.xlarge
- Set up API Gateway for HTTP access (optional)
Detailed AWS instructions can be found in the docs/aws_deployment.md
file.
3. Azure Machine Learning
For Azure deployment:
- Create an Azure ML workspace
- Register the model on Azure ML
- Create an inference configuration
- Deploy to AKS or ACI with a GPU-enabled instance
Detailed Azure instructions can be found in the docs/azure_deployment.md
file.
How It Works
The application uses the Qwen-VL-Chat model, a state-of-the-art multimodal AI model that can understand and describe images with impressive detail.
The script:
- Processes the image with three different prompts:
- "Describe this image briefly in a single paragraph."
- "Analyze this image in detail. Describe the main elements, any text visible, the colors, and the overall composition."
- "What can you tell me about the technical aspects of this image?"
- Uses 8-bit quantization to reduce memory requirements
- Formats and displays the results
Repository Structure
app.py
- Gradio UI for web interfaceDockerfile
- For containerized deploymentrequirements.txt
- Python dependenciesdata_temp/
- Sample images for testing
Local Development
Install the required packages:
pip install -r requirements.txt
Run the Gradio UI:
python app.py
Visit
http://localhost:7860
in your browser
Example Output
Processing image: data_temp/page_2.png
Loading model...
Generating descriptions...
==== Image Description Results (Qwen-VL) ====
Basic Description:
The image shows a webpage or document with text content organized in multiple columns.
Detailed Description:
The image displays a structured document or webpage with multiple sections of text organized in a grid layout. The content appears to be technical or educational in nature, with what looks like headings and paragraphs of text. The color scheme is primarily black text on a white background, creating a clean, professional appearance. There appear to be multiple columns of information, possibly representing different topics or categories. The layout suggests this might be documentation, a reference guide, or an educational resource related to technical content.
Technical Analysis:
This appears to be a screenshot of a digital document or webpage. The image quality is good with clear text rendering, suggesting it was captured at an appropriate resolution. The image uses a standard document layout with what appears to be a grid or multi-column structure. The screenshot has been taken of what seems to be a text-heavy interface with minimal graphics, consistent with technical documentation or reference materials.
Note: Actual descriptions will vary based on the specific image content and may be more detailed than this example.