Spaces:

rmoxon
/

strandtest

Sleeping

App Files Files Community

rmoxon commited on Jul 15

Commit

87fa678

verified ·

1 Parent(s): 1b5b123

Upload 5 files

Browse files

Files changed (4) hide show

Dockerfile +32 -32
README.md +97 -97
main.py +140 -1
requirements.txt +11 -9

Dockerfile CHANGED Viewed

@@ -1,33 +1,33 @@
-FROM python:3.11-slim
-WORKDIR /code
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    wget \
-    curl \
-    && rm -rf /var/lib/apt/lists/*
-# Create cache directories with proper permissions
-RUN mkdir -p /code/cache && \
-    mkdir -p /tmp/cache && \
-    chmod 777 /code/cache && \
-    chmod 777 /tmp/cache
-# Set environment variables for cache directories
-ENV TRANSFORMERS_CACHE=/code/cache
-ENV HF_HOME=/code/cache
-ENV TORCH_HOME=/code/cache
-# Copy requirements and install Python dependencies
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-# Copy application code
-COPY . .
-# Expose port 7860 (Hugging Face Spaces default)
-EXPOSE 7860
-# Run the application
 CMD ["python", "app.py"]

+FROM python:3.11-slim
+WORKDIR /code
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    wget \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Create cache directories with proper permissions
+RUN mkdir -p /code/cache && \
+    mkdir -p /tmp/cache && \
+    chmod 777 /code/cache && \
+    chmod 777 /tmp/cache
+# Set environment variables for cache directories
+ENV TRANSFORMERS_CACHE=/code/cache
+ENV HF_HOME=/code/cache
+ENV TORCH_HOME=/code/cache
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Expose port 7860 (Hugging Face Spaces default)
+EXPOSE 7860
+# Run the application
 CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,98 +1,98 @@
----
-title: CLIP Service
-emoji: 🔍
-colorFrom: blue
-colorTo: purple
-sdk: docker
-pinned: false
----
-# CLIP Service 🔍
-A FastAPI service that provides CLIP (Contrastive Language-Image Pre-training) embeddings for images and text using the `openai/clip-vit-large-patch14` model.
-## 🚀 Features
-- **Image Encoding**: Generate 768-dimensional embeddings from image URLs
-- **Text Encoding**: Generate embeddings from text descriptions
-- **High Performance**: Optimized for batch processing
-- **REST API**: Simple HTTP endpoints for easy integration
-## 📋 API Endpoints
-### `POST /encode/image`
-Generate embeddings for an image from URL.
-**Request:**
-```json
-{
-  "image_url": "https://example.com/image.jpg"
-}
-```
-**Response:**
-```json
-{
-  "embedding": [0.1, -0.2, 0.3, ...], // 768 dimensions
-  "dimensions": 768
-}
-```
-### `POST /encode/text`
-Generate embeddings for text.
-**Request:**
-```json
-{
-  "text": "a beautiful sunset over mountains"
-}
-```
-**Response:**
-```json
-{
-  "embedding": [0.1, -0.2, 0.3, ...], // 768 dimensions
-  "dimensions": 768
-}
-```
-### `GET /health`
-Check service health and status.
-## 🔧 Usage Examples
-```bash
-# Encode an image
-curl -X POST "https://your-username-clip-service.hf.space/encode/image" \
-  -H "Content-Type: application/json" \
-  -d '{"image_url": "https://example.com/image.jpg"}'
-# Encode text
-curl -X POST "https://your-username-clip-service.hf.space/encode/text" \
-  -H "Content-Type: application/json" \
-  -d '{"text": "a beautiful landscape"}'
-```
-## 🏗️ Integration
-This service is designed to work with Pinterest-like applications for:
-- Visual similarity search
-- Content-based recommendations
-- Cross-modal search (text to image, image to text)
-## 📝 Model Information
-- **Model**: `openai/clip-vit-large-patch14`
-- **Embedding Dimensions**: 768
-- **Supported Images**: JPG, PNG, GIF, WebP
-- **Max Image Size**: Recommended < 10MB
-## ⚡ Performance
-- **CPU**: ~2-5 seconds per image
-- **GPU**: ~0.5-1 second per image (when available)
-- **Batch Processing**: Supported for multiple requests
----
 Built with ❤️ using [Transformers](https://huggingface.co/transformers) and [FastAPI](https://fastapi.tiangolo.com/)

+---
+title: CLIP Service
+emoji: 🔍
+colorFrom: blue
+colorTo: purple
+sdk: docker
+pinned: false
+---
+# CLIP Service 🔍
+A FastAPI service that provides CLIP (Contrastive Language-Image Pre-training) embeddings for images and text using the `openai/clip-vit-large-patch14` model.
+## 🚀 Features
+- **Image Encoding**: Generate 768-dimensional embeddings from image URLs
+- **Text Encoding**: Generate embeddings from text descriptions
+- **High Performance**: Optimized for batch processing
+- **REST API**: Simple HTTP endpoints for easy integration
+## 📋 API Endpoints
+### `POST /encode/image`
+Generate embeddings for an image from URL.
+**Request:**
+```json
+{
+  "image_url": "https://example.com/image.jpg"
+}
+```
+**Response:**
+```json
+{
+  "embedding": [0.1, -0.2, 0.3, ...], // 768 dimensions
+  "dimensions": 768
+}
+```
+### `POST /encode/text`
+Generate embeddings for text.
+**Request:**
+```json
+{
+  "text": "a beautiful sunset over mountains"
+}
+```
+**Response:**
+```json
+{
+  "embedding": [0.1, -0.2, 0.3, ...], // 768 dimensions
+  "dimensions": 768
+}
+```
+### `GET /health`
+Check service health and status.
+## 🔧 Usage Examples
+```bash
+# Encode an image
+curl -X POST "https://your-username-clip-service.hf.space/encode/image" \
+  -H "Content-Type: application/json" \
+  -d '{"image_url": "https://example.com/image.jpg"}'
+# Encode text
+curl -X POST "https://your-username-clip-service.hf.space/encode/text" \
+  -H "Content-Type: application/json" \
+  -d '{"text": "a beautiful landscape"}'
+```
+## 🏗️ Integration
+This service is designed to work with Pinterest-like applications for:
+- Visual similarity search
+- Content-based recommendations
+- Cross-modal search (text to image, image to text)
+## 📝 Model Information
+- **Model**: `openai/clip-vit-large-patch14`
+- **Embedding Dimensions**: 768
+- **Supported Images**: JPG, PNG, GIF, WebP
+- **Max Image Size**: Recommended < 10MB
+## ⚡ Performance
+- **CPU**: ~2-5 seconds per image
+- **GPU**: ~0.5-1 second per image (when available)
+- **Batch Processing**: Supported for multiple requests
+---
 Built with ❤️ using [Transformers](https://huggingface.co/transformers) and [FastAPI](https://fastapi.tiangolo.com/)

main.py CHANGED Viewed

@@ -1,12 +1,14 @@
 from fastapi import FastAPI, HTTPException
 from pydantic import BaseModel
-from transformers import CLIPProcessor, CLIPModel
 import torch
 from PIL import Image
 import requests
 import numpy as np
 import io
 import logging
 # Configure logging
 logging.basicConfig(level=logging.INFO)
@@ -20,6 +22,11 @@ class CLIPService:
         self.model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
         self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
         logger.info("CLIP model loaded successfully")
     def encode_image(self, image_url: str) -> list:
         try:
@@ -84,6 +91,123 @@ class CLIPService:
         except Exception as e:
             logger.error(f"Error encoding text '{text}': {str(e)}")
             raise HTTPException(status_code=500, detail=f"Failed to encode text: {str(e)}")
 # Initialize service
 clip_service = CLIPService()
@@ -94,6 +218,9 @@ class ImageRequest(BaseModel):
 class TextRequest(BaseModel):
     text: str
 @app.post("/encode/image")
 async def encode_image(request: ImageRequest):
     embedding = clip_service.encode_image(request.image_url)
@@ -104,6 +231,18 @@ async def encode_text(request: TextRequest):
     embedding = clip_service.encode_text(request.text)
     return {"embedding": embedding}
 @app.get("/health")
 async def health_check():
     return {"status": "healthy", "model": "clip-vit-large-patch14"}

 from fastapi import FastAPI, HTTPException
 from pydantic import BaseModel
+from transformers import CLIPProcessor, CLIPModel, ClapModel, ClapProcessor
 import torch
 from PIL import Image
 import requests
 import numpy as np
 import io
 import logging
+import librosa
+import soundfile as sf
 # Configure logging
 logging.basicConfig(level=logging.INFO)
         self.model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
         self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
         logger.info("CLIP model loaded successfully")
+        logger.info("Loading CLAP model for audio...")
+        self.clap_model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
+        self.clap_processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")
+        logger.info("CLAP model loaded successfully")
     def encode_image(self, image_url: str) -> list:
         try:
         except Exception as e:
             logger.error(f"Error encoding text '{text}': {str(e)}")
             raise HTTPException(status_code=500, detail=f"Failed to encode text: {str(e)}")
+    def encode_audio(self, audio_url: str) -> list:
+        try:
+            # Enhanced headers for audio files with MIME whitelist
+            headers = {
+                'User-Agent': 'CLAP-Service/1.0 (Audio-Embedding-Service)',
+                'Accept': 'audio/mpeg, audio/wav, audio/mp4, audio/ogg, audio/flac',
+                'Cache-Control': 'no-cache'
+            }
+            logger.info(f"Fetching audio from URL: {audio_url}")
+            # Increase timeout for large files, but add streaming response
+            response = requests.get(audio_url, timeout=60, headers=headers, stream=True)
+            response.raise_for_status()
+            # Check content type before processing
+            content_type = response.headers.get('content-type', 'unknown')
+            if not content_type.startswith('audio/'):
+                raise ValueError(f"Invalid content type: {content_type}. Expected audio/*")
+            # Check file size before downloading (100MB limit)
+            content_length = response.headers.get('content-length')
+            if content_length and int(content_length) > 100 * 1024 * 1024:
+                raise ValueError(f"Audio file too large: {content_length} bytes. Maximum is 100MB")
+            # Stream content to BytesIO with size limit
+            audio_data = io.BytesIO()
+            total_size = 0
+            max_size = 100 * 1024 * 1024  # 100MB
+            for chunk in response.iter_content(chunk_size=8192):
+                total_size += len(chunk)
+                if total_size > max_size:
+                    raise ValueError("Audio file too large during download")
+                audio_data.write(chunk)
+            audio_data.seek(0)
+            logger.info(f"Successfully fetched audio: {content_type}, {total_size} bytes")
+            # Load audio with duration limit (10 minutes = 600 seconds)
+            MAX_DURATION = 600  # 10 minutes
+            try:
+                # First, get duration without loading full audio
+                duration = librosa.get_duration(path=audio_data)
+                audio_data.seek(0)  # Reset stream
+                if duration > MAX_DURATION:
+                    raise ValueError(f"Audio duration ({duration:.1f}s) exceeds maximum allowed ({MAX_DURATION}s)")
+                logger.info(f"Audio duration: {duration:.1f} seconds")
+                # Load only first 30 seconds for embedding (CLAP works well with shorter clips)
+                # This reduces memory usage significantly
+                duration_limit = min(30.0, duration)
+                # Load audio with librosa (48kHz is CLAP's expected sample rate)
+                waveform, sample_rate = librosa.load(
+                    audio_data,
+                    sr=48000,
+                    mono=True,
+                    duration=duration_limit,
+                    offset=0.0
+                )
+                logger.info(f"Processing audio: {len(waveform)} samples at {sample_rate}Hz ({duration_limit:.1f}s)")
+            except Exception as e:
+                logger.error(f"Error loading audio file: {str(e)}")
+                raise ValueError(f"Failed to load audio file: {str(e)}")
+            # Process audio through CLAP
+            inputs = self.clap_processor(audios=waveform, return_tensors="pt", sampling_rate=48000)
+            with torch.no_grad():
+                audio_features = self.clap_model.get_audio_features(**inputs)
+                # Normalize the features
+                audio_features = audio_features / audio_features.norm(dim=-1, keepdim=True)
+            embedding = audio_features.numpy().flatten().tolist()
+            logger.info(f"Generated audio embedding with {len(embedding)} dimensions")
+            return embedding
+        except ValueError as e:
+            # Handle validation errors (file too large, wrong format, etc.)
+            logger.error(f"Validation error for audio {audio_url}: {str(e)}")
+            raise HTTPException(status_code=400, detail=str(e))
+        except requests.exceptions.RequestException as e:
+            logger.error(f"Network error fetching audio {audio_url}: {str(e)}")
+            if hasattr(e, 'response') and e.response is not None:
+                status_code = e.response.status_code
+                if status_code == 403:
+                    raise HTTPException(status_code=403, detail="Access denied to audio URL")
+                elif status_code == 404:
+                    raise HTTPException(status_code=404, detail="Audio not found at URL")
+                elif status_code >= 500:
+                    raise HTTPException(status_code=502, detail="Audio service temporarily unavailable")
+            raise HTTPException(status_code=500, detail=f"Failed to fetch audio: {str(e)}")
+        except Exception as e:
+            logger.error(f"Error encoding audio {audio_url}: {str(e)}")
+            raise HTTPException(status_code=500, detail=f"Failed to encode audio: {str(e)}")
+    def encode_text_for_audio(self, text: str) -> list:
+        """Encode text for cross-modal audio search"""
+        try:
+            inputs = self.clap_processor(text=[text], return_tensors="pt", padding=True)
+            with torch.no_grad():
+                text_features = self.clap_model.get_text_features(**inputs)
+                text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+            return text_features.numpy().flatten().tolist()
+        except Exception as e:
+            logger.error(f"Error encoding text for audio '{text}': {str(e)}")
+            raise HTTPException(status_code=500, detail=f"Failed to encode text for audio: {str(e)}")
 # Initialize service
 clip_service = CLIPService()
 class TextRequest(BaseModel):
     text: str
+class AudioRequest(BaseModel):
+    audio_url: str
 @app.post("/encode/image")
 async def encode_image(request: ImageRequest):
     embedding = clip_service.encode_image(request.image_url)
     embedding = clip_service.encode_text(request.text)
     return {"embedding": embedding}
+@app.post("/encode/audio")
+async def encode_audio(request: AudioRequest):
+    """Encode audio file to CLAP embedding vector"""
+    embedding = clip_service.encode_audio(request.audio_url)
+    return {"embedding": embedding}
+@app.post("/encode/text-audio")
+async def encode_text_for_audio(request: TextRequest):
+    """Encode text for audio similarity search"""
+    embedding = clip_service.encode_text_for_audio(request.text)
+    return {"embedding": embedding}
 @app.get("/health")
 async def health_check():
     return {"status": "healthy", "model": "clip-vit-large-patch14"}

requirements.txt CHANGED Viewed

@@ -1,9 +1,11 @@
-torch==2.0.1
-transformers==4.30.0
-Pillow==9.5.0
-requests==2.31.0
-fastapi==0.104.1
-uvicorn==0.22.0
-python-multipart==0.0.6
-pydantic==2.5.0
-numpy<2.0.0

+torch==2.0.1
+transformers==4.30.0
+Pillow==9.5.0
+requests==2.31.0
+fastapi==0.104.1
+uvicorn==0.22.0
+python-multipart==0.0.6
+pydantic==2.5.0
+numpy<2.0.0
+librosa>=0.10.0
+soundfile>=0.12.1