Spaces:

zorba111
/

ui-coordinates-finder

Build error

App Files Files Community

zorba111 commited on Oct 29, 2024

Commit

36a599e

verified ·

1 Parent(s): 2ad48f3

Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

Dockerfile +26 -0
README.md +33 -10
api.py +19 -98
modal_app.py +144 -0
requirements.txt +29 -16
test-api.py +30 -0
test_api.py +43 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,26 @@

+FROM python:3.12-slim
+WORKDIR /app
+# Copy requirements first to leverage Docker cache
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the rest of the application
+COPY . .
+# Install system dependencies for PIL and torch
+RUN apt-get update && apt-get install -y \
+  libgl1-mesa-glx \
+  libglib2.0-0 \
+  && rm -rf /var/lib/apt/lists/*
+# Set environment variables
+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV GRADIO_SERVER_PORT=7860
+# Expose the port
+EXPOSE 7860
+# Run the application
+CMD ["python", "gradio_demo.py"]

README.md CHANGED Viewed

@@ -4,6 +4,7 @@ app_file: gradio_demo.py
 sdk: gradio
 sdk_version: 5.4.0
 ---
 # OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent
 <p align="center">
@@ -13,50 +14,72 @@ sdk_version: 5.4.0
 [![arXiv](https://img.shields.io/badge/Paper-green)](https://arxiv.org/abs/2408.00203)
 [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)] [[Models](https://huggingface.co/microsoft/OmniParser)]
-**OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
 ## News
 - [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https://huggingface.co/microsoft/OmniParser)
-- [2024/09] OmniParser achieves the best performance on [Windows Agent Arena](https://microsoft.github.io/WindowsAgentArena/)!
-## Install
 Install environment:
 ```python
 conda create -n "omni" python==3.12
 conda activate omni
 pip install -r requirements.txt
 ```
-Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.
-Finally, convert the safetensor to .pt file.
 ```python
 python weights/convert_safetensor_to_pt.py
 ```
 ## Examples:
-We put together a few simple examples in the demo.ipynb.
 ## Gradio Demo
 To run gradio demo, simply run:
 ```python
 python gradio_demo.py
 ```
 ## 📚 Citation
 Our technical report can be found [here](https://arxiv.org/abs/2408.00203).
 If you find our work useful, please consider citing our work:
 ```
 @misc{lu2024omniparserpurevisionbased,
-      title={OmniParser for Pure Vision Based GUI Agent},
       author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
       year={2024},
       eprint={2408.00203},
       archivePrefix={arXiv},
       primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2408.00203},
 }
 ```

 sdk: gradio
 sdk_version: 5.4.0
 ---
 # OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent
 <p align="center">
 [![arXiv](https://img.shields.io/badge/Paper-green)](https://arxiv.org/abs/2408.00203)
 [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)] [[Models](https://huggingface.co/microsoft/OmniParser)]
+**OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
 ## News
 - [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https://huggingface.co/microsoft/OmniParser)
+- [2024/09] OmniParser achieves the best performance on [Windows Agent Arena](https://microsoft.github.io/WindowsAgentArena/)!
+## Install
 Install environment:
 ```python
 conda create -n "omni" python==3.12
 conda activate omni
 pip install -r requirements.txt
 ```
+Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2.
+Finally, convert the safetensor to .pt file.
 ```python
 python weights/convert_safetensor_to_pt.py
 ```
 ## Examples:
+We put together a few simple examples in the demo.ipynb.
 ## Gradio Demo
 To run gradio demo, simply run:
 ```python
 python gradio_demo.py
 ```
 ## 📚 Citation
 Our technical report can be found [here](https://arxiv.org/abs/2408.00203).
 If you find our work useful, please consider citing our work:
 ```
 @misc{lu2024omniparserpurevisionbased,
+      title={OmniParser for Pure Vision Based GUI Agent},
       author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
       year={2024},
       eprint={2408.00203},
       archivePrefix={arXiv},
       primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2408.00203},
 }
 ```
+title: Ui Element Coordinates Finder
+emoji: 🏢
+colorFrom: pink
+colorTo: red
+sdk: gradio
+sdk_version: 5.4.0
+app_file: app.py
+pinned: false
+license: mit
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

api.py CHANGED Viewed

@@ -1,105 +1,26 @@
-from fastapi import FastAPI, UploadFile, File, HTTPException
-from pydantic import BaseModel
-from PIL import Image
-import io
-import torch
 from slowapi import Limiter, _rate_limit_exceeded_handler
 from slowapi.util import get_remote_address
 from slowapi.errors import RateLimitExceeded
-# Import your existing utilities and models
-from utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
-# Initialize FastAPI app
-app = FastAPI(title="OmniParser API")
-app.state.limiter = Limiter(key_func=get_remote_address)
 app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
-# Load models at startup (reusing your existing code)
-yolo_model = get_yolo_model(model_path='weights/icon_detect/best.pt')
-caption_model_processor = get_caption_model_processor(
-    model_name="florence2",
-    model_name_or_path="weights/icon_caption_florence"
-)
-# Define request model
-class ProcessRequest(BaseModel):
-    box_threshold: float = 0.05
-    iou_threshold: float = 0.1
-    screen_width: int = 1920
-    screen_height: int = 1080
 @app.post("/process")
-@app.state.limiter.limit("5/minute")  # Limit to 5 requests per minute per IP
-async def process_image(
-    file: UploadFile = File(...),
-    params: ProcessRequest = None
-):
-    # Read image from request
-    image_bytes = await file.read()
-    image = Image.open(io.BytesIO(image_bytes))
-    # Save image temporarily (reusing your existing logic)
-    temp_path = 'imgs/temp_image.png'
-    image.save(temp_path)
-    # Process image using your existing functions
-    ocr_bbox_rslt, _ = check_ocr_box(
-        temp_path,
-        display_img=False,
-        output_bb_format='xyxy',
-        goal_filtering=None,
-        easyocr_args={'paragraph': False, 'text_threshold':0.9}
-    )
-    text, ocr_bbox = ocr_bbox_rslt
-    dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
-        temp_path,
-        yolo_model,
-        BOX_TRESHOLD=params.box_threshold,
-        output_coord_in_ratio=True,
-        ocr_bbox=ocr_bbox,
-        draw_bbox_config={
-            'text_scale': 0.8,
-            'text_thickness': 2,
-            'text_padding': 2,
-            'thickness': 2,
-        },
-        caption_model_processor=caption_model_processor,
-        ocr_text=text,
-        iou_threshold=params.iou_threshold
-    )
-    # Format output (similar to your existing code)
-    output_text = []
-    for i, (element_id, coords) in enumerate(label_coordinates.items()):
-        x, y, w, h = coords
-        center_x_norm = x + (w/2)
-        center_y_norm = y + (h/2)
-        screen_x = int(center_x_norm * params.screen_width)
-        screen_y = int(center_y_norm * params.screen_height)
-        screen_w = int(w * params.screen_width)
-        screen_h = int(h * params.screen_height)
-        element_desc = parsed_content_list[i] if i < len(parsed_content_list) else f"Icon {i}"
-        output_text.append({
-            "description": element_desc,
-            "normalized_coordinates": {
-                "x": center_x_norm,
-                "y": center_y_norm
-            },
-            "screen_coordinates": {
-                "x": screen_x,
-                "y": screen_y
-            },
-            "dimensions": {
-                "width": screen_w,
-                "height": screen_h
-            }
-        })
-    return {
-        "processed_image": dino_labled_img,  # Base64 encoded image
-        "elements": output_text
-    }

+from fastapi import FastAPI, File, UploadFile, Request
 from slowapi import Limiter, _rate_limit_exceeded_handler
 from slowapi.util import get_remote_address
 from slowapi.errors import RateLimitExceeded
+from fastapi.responses import JSONResponse
+app = FastAPI()
+limiter = Limiter(key_func=get_remote_address)
+app.state.limiter = limiter
 app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
 @app.post("/process")
+@limiter.limit("5/minute")
+async def process_image(request: Request, file: UploadFile = File(...)):
+    try:
+        contents = await file.read()
+        # Your processing logic here
+        return JSONResponse(
+            status_code=200,
+            content={"message": "Success", "filename": file.filename}
+        )
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={"error": str(e)}
+        )

modal_app.py ADDED Viewed

	@@ -0,0 +1,144 @@

+import modal
+from fastapi import FastAPI, File, UploadFile, Request
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+from PIL import Image
+import io
+import base64
+from typing import Optional
+import traceback
+# Create app and web app
+app = modal.App("ui-coordinates-finder")
+web_app = FastAPI()
+# Add your model initialization to the app
+@app.function(gpu="T4")
+def init_models():
+    from utils import get_yolo_model, get_caption_model_processor
+    yolo_model = get_yolo_model(model_path='weights/icon_detect/best.pt')
+    caption_model_processor = get_caption_model_processor(
+        model_name="florence2",
+        model_name_or_path="weights/icon_caption_florence"
+    )
+    return yolo_model, caption_model_processor
+# Configure CORS
+web_app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+@app.function(gpu="T4", timeout=300)
+@web_app.post("/process")
+async def process_image_endpoint(
+    request: Request,
+    file: UploadFile = File(...),
+    box_threshold: float = 0.05,
+    iou_threshold: float = 0.1,
+    screen_width: int = 1920,
+    screen_height: int = 1080
+):
+    try:
+        # Add logging for debugging
+        print(f"Processing file: {file.filename}")
+        # Read and process the image
+        contents = await file.read()
+        print("File read successfully")
+        # Save image temporarily
+        image_save_path = '/tmp/saved_image_demo.png'
+        image = Image.open(io.BytesIO(contents))
+        image.save(image_save_path)
+        # Initialize models
+        yolo_model, caption_model_processor = init_models()
+        # Process with OCR and detection
+        from utils import check_ocr_box, get_som_labeled_img
+        draw_bbox_config = {
+            'text_scale': 0.8,
+            'text_thickness': 2,
+            'text_padding': 2,
+            'thickness': 2,
+        }
+        ocr_bbox_rslt, _ = check_ocr_box(
+            image_save_path,
+            display_img=False,
+            output_bb_format='xyxy',
+            goal_filtering=None,
+            easyocr_args={'paragraph': False, 'text_threshold': 0.9}
+        )
+        text, ocr_bbox = ocr_bbox_rslt
+        dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
+            image_save_path,
+            yolo_model,
+            BOX_TRESHOLD=box_threshold,
+            output_coord_in_ratio=True,
+            ocr_bbox=ocr_bbox,
+            draw_bbox_config=draw_bbox_config,
+            caption_model_processor=caption_model_processor,
+            ocr_text=text,
+            iou_threshold=iou_threshold
+        )
+        # Format the output similar to Gradio demo
+        output_text = []
+        for i, (element_id, coords) in enumerate(label_coordinates.items()):
+            x, y, w, h = coords
+            # Calculate center points (normalized)
+            center_x_norm = x + (w/2)
+            center_y_norm = y + (h/2)
+            # Calculate screen coordinates
+            screen_x = int(center_x_norm * screen_width)
+            screen_y = int(center_y_norm * screen_height)
+            screen_w = int(w * screen_width)
+            screen_h = int(h * screen_height)
+            if i < len(parsed_content_list):
+                element_desc = parsed_content_list[i]
+                output_text.append({
+                    "description": element_desc,
+                    "normalized_coords": (center_x_norm, center_y_norm),
+                    "screen_coords": (screen_x, screen_y),
+                    "dimensions": (screen_w, screen_h)
+                })
+        return JSONResponse(
+            status_code=200,
+            content={
+                "message": "Success",
+                "filename": file.filename,
+                "processed_image": dino_labled_img,  # Base64 encoded image
+                "elements": output_text
+            }
+        )
+    except Exception as e:
+        error_details = traceback.format_exc()
+        print(f"Error processing request: {error_details}")
+        return JSONResponse(
+            status_code=500,
+            content={
+                "error": str(e),
+                "details": error_details
+            }
+        )
+@app.function()
+@modal.asgi_app()
+def fastapi_app():
+    return web_app
+if __name__ == "__main__":
+    app.serve()

requirements.txt CHANGED Viewed

@@ -1,16 +1,29 @@
-torch
-easyocr
-torchvision
-supervision==0.18.0
-openai==1.3.5
-transformers
-ultralytics==8.1.24
-azure-identity
-numpy
-opencv-python
-opencv-python-headless
-gradio
-dill
-accelerate
-timm
-einops==0.8.0

+# Use Python 3.12 as base image
+FROM python:3.12-slim
+# Install system dependencies required for OpenCV and other packages
+RUN apt-get update && apt-get install -y \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Copy requirements and app files
+COPY requirements.txt .
+COPY . .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Set environment variables
+ENV GRADIO_SERVER_NAME="0.0.0.0"
+ENV GRADIO_SERVER_PORT=7860
+# Expose the port Gradio will run on
+EXPOSE 7860
+# Command to run the application
+CMD ["python", "gradio_demo.py"]

test-api.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import requests
+import json
+def test_api():
+    url = "https://zorba11--ui-coordinates-finder-fastapi-app.modal.run/process"
+    headers = {
+        'Accept': 'application/json',
+    }
+    try:
+        files = {
+            'file': ('screen-1.png', open('/Users/zorba11/Desktop/screen-1.png', 'rb'), 'image/png')
+        }
+        response = requests.post(
+            url,
+            files=files,
+            headers=headers
+        )
+        print(f"Status Code: {response.status_code}")
+        print(f"Response Headers: {dict(response.headers)}")
+        print(f"Response Content: {response.content.decode()}")
+    except Exception as e:
+        print(f"Error: {str(e)}")
+if __name__ == "__main__":
+    test_api()

test_api.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import requests
+from PIL import Image
+import base64
+import io
+def test_api():
+    url = "https://zorba11--ui-coordinates-finder-fastapi-app.modal.run/process"
+    # Parameters matching your Gradio demo
+    params = {
+        'box_threshold': 0.05,
+        'iou_threshold': 0.1,
+        'screen_width': 1920,
+        'screen_height': 1080
+    }
+    files = {
+        'file': ('screen-1.png', open('/Users/zorba11/Desktop/screen-1.png', 'rb'), 'image/png')
+    }
+    response = requests.post(url, files=files, params=params)
+    if response.status_code == 200:
+        result = response.json()
+        # Convert base64 image back to PIL Image
+        img_data = base64.b64decode(result['processed_image'])
+        processed_image = Image.open(io.BytesIO(img_data))
+        # Save the processed image
+        processed_image.save('processed_output.png')
+        # Print the detected elements
+        for element in result['elements']:
+            print("\nElement:", element['description'])
+            print("Normalized coordinates:", element['normalized_coords'])
+            print("Screen coordinates:", element['screen_coords'])
+            print("Dimensions:", element['dimensions'])
+    else:
+        print("Error:", response.text)
+if __name__ == "__main__":
+    test_api()