Rudra Rahul Chothe commited on
Commit
6f0417c
Β·
verified Β·
1 Parent(s): 87ffa09

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +117 -122
  2. approach.txt +166 -0
  3. requirements.txt +7 -10
  4. src/preprocessing.py +62 -44
  5. src/similarity_search.py +21 -23
README.md CHANGED
@@ -1,122 +1,117 @@
1
- ---
2
- language: en
3
- license: mit
4
- tags:
5
- - image-search
6
- - machine-learning
7
- title: Image Search Engine Fashion
8
- sdk: streamlit
9
- emoji: πŸ’»
10
- colorFrom: blue
11
- colorTo: pink
12
- ---
13
-
14
- ## Image Similarity Search Engine
15
- A deep learning-based image similarity search engine that uses EfficientNetB0 for feature extraction and FAISS for fast similarity search. The application provides a web interface built with Streamlit for easy interaction.
16
-
17
- Features
18
- - Deep Feature Extraction: Uses EfficientNetB0 (pre-trained on ImageNet) to extract meaningful features from images
19
- - Fast Similarity Search: Implements FAISS for efficient nearest-neighbor search
20
- - Interactive Web Interface: User-friendly interface built with Streamlit
21
- - Real-time Processing: Shows progress and time estimates during feature extraction
22
- - Scalable Architecture: Designed to handle large image datasets efficiently
23
-
24
- ## Installation
25
- ## Prerequisites
26
-
27
- Python 3.8 or higher
28
- pip package manager
29
-
30
- ## Setup
31
-
32
- 1. Clone the repository:
33
- ```
34
- git clone https://github.com/yourusername/image-similarity-search.git
35
- cd image-similarity-search
36
- ```
37
- 2. Create and activate a virtual environment:
38
- ```
39
- python -m venv venv
40
- source venv/bin/activate # On Windows use: venv\Scripts\activate
41
- ```
42
- 3. Install required packages:
43
- ```
44
- pip install -r requirements.txt
45
- ```
46
-
47
- ## Project Structure
48
- ```
49
- image-similarity-search/
50
- β”œβ”€β”€ data/
51
- β”‚ β”œβ”€β”€ images/ # Directory for train dataset images
52
- β”‚ β”œβ”€β”€ sample-test-images/ # Directory for test dataset images
53
- β”‚ └── embeddings.pkl # Pre-computed image embeddings
54
- β”œβ”€β”€ src/
55
- β”‚ β”œβ”€β”€ feature_extractor.py # EfficientNetB0 feature extraction
56
- β”‚ β”œβ”€β”€ preprocessing.py # Image preprocessing and embedding computation
57
- β”‚ β”œβ”€β”€ similarity_search.py # FAISS-based similarity search
58
- β”‚ └── main.py # Streamlit web interface
59
- β”œβ”€β”€ requirements.txt
60
- β”œβ”€β”€ README.md
61
- └── .gitignore
62
- ```
63
- ## Usage
64
-
65
- 1. **Prepare Your Dataset:**
66
- Get training image dataset from drive:
67
- ```
68
- https://drive.google.com/file/d/1U2PljA7NE57jcSSzPs21ZurdIPXdYZtN/view?usp=drive_link
69
- ```
70
- Place your image dataset in the data/images directory
71
- Supported formats: JPG, JPEG, PNG
72
-
73
- 2. **Generate Embeddings:**
74
- ```
75
- python -m src.preprocessing
76
- ```
77
-
78
- **This will**:
79
- - Process all images in the dataset
80
- - Show progress and time estimates
81
- - Save embeddings to data/embeddings.pkl
82
-
83
- 3. **Run the Web Interface:**
84
- ```
85
- streamlit run src/main.py
86
- ```
87
-
88
- 4. Using the Interface:
89
-
90
- - Upload a query image using the file uploader
91
- - Click "Search Similar Images"
92
- - View the most similar images from your dataset
93
-
94
-
95
-
96
- ## Technical Details
97
- **Feature Extraction**
98
- - Uses EfficientNetB0 without top layers
99
- - Input image size: 224x224 pixels
100
- - Output feature dimension: 1280
101
-
102
- **Similarity Search**
103
- - Uses FAISS IndexFlatL2 for L2 distance-based search
104
- - Returns top-k most similar images (default k=5)
105
-
106
- **Web Interface**
107
- - Responsive design with Streamlit
108
- - Displays original and similar images with similarity scores
109
- - Progress tracking during processing
110
-
111
- **Dependencies**
112
- - TensorFlow 2.x
113
- - FAISS-cpu (or FAISS-gpu for GPU support)
114
- - Streamlit
115
- - Pillow
116
- - NumPy
117
- - tqdm
118
-
119
- **Performance**
120
- - Feature extraction: ~1 second per image on CPU
121
- - Similarity search: Near real-time for datasets up to 100k images
122
- - Memory usage depends on dataset size (approximately 5KB per image embedding)
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - image-search
6
+ - machine-learning
7
+ ---
8
+
9
+ ## Image Similarity Search Engine
10
+ A deep learning-based image similarity search engine that uses EfficientNetB0 for feature extraction and FAISS for fast similarity search. The application provides a web interface built with Streamlit for easy interaction.
11
+
12
+ Features
13
+ - Deep Feature Extraction: Uses EfficientNetB0 (pre-trained on ImageNet) to extract meaningful features from images
14
+ - Fast Similarity Search: Implements FAISS for efficient nearest-neighbor search
15
+ - Interactive Web Interface: User-friendly interface built with Streamlit
16
+ - Real-time Processing: Shows progress and time estimates during feature extraction
17
+ - Scalable Architecture: Designed to handle large image datasets efficiently
18
+
19
+ ## Installation
20
+ ## Prerequisites
21
+
22
+ Python 3.8 or higher
23
+ pip package manager
24
+
25
+ ## Setup
26
+
27
+ 1. Clone the repository:
28
+ ```
29
+ git clone https://github.com/yourusername/image-similarity-search.git
30
+ cd image-similarity-search
31
+ ```
32
+ 2. Create and activate a virtual environment:
33
+ ```
34
+ python -m venv venv
35
+ source venv/bin/activate # On Windows use: venv\Scripts\activate
36
+ ```
37
+ 3. Install required packages:
38
+ ```
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ## Project Structure
43
+ ```
44
+ image-similarity-search/
45
+ β”œβ”€β”€ data/
46
+ β”‚ β”œβ”€β”€ images/ # Directory for train dataset images
47
+ β”‚ β”œβ”€β”€ sample-test-images/ # Directory for test dataset images
48
+ οΏ½οΏ½οΏ½ └── embeddings.pkl # Pre-computed image embeddings
49
+ β”œβ”€β”€ src/
50
+ β”‚ β”œβ”€β”€ feature_extractor.py # EfficientNetB0 feature extraction
51
+ β”‚ β”œβ”€β”€ preprocessing.py # Image preprocessing and embedding computation
52
+ β”‚ β”œβ”€β”€ similarity_search.py # FAISS-based similarity search
53
+ β”‚ └── main.py # Streamlit web interface
54
+ β”œβ”€β”€ requirements.txt
55
+ β”œβ”€β”€ README.md
56
+ └── .gitignore
57
+ ```
58
+ ## Usage
59
+
60
+ 1. **Prepare Your Dataset:**
61
+ Get training image dataset from drive:
62
+ ```
63
+ https://drive.google.com/file/d/1U2PljA7NE57jcSSzPs21ZurdIPXdYZtN/view?usp=drive_link
64
+ ```
65
+ Place your image dataset in the data/images directory
66
+ Supported formats: JPG, JPEG, PNG
67
+
68
+ 2. **Generate Embeddings:**
69
+ ```
70
+ python -m src.preprocessing
71
+ ```
72
+
73
+ **This will**:
74
+ - Process all images in the dataset
75
+ - Show progress and time estimates
76
+ - Save embeddings to data/embeddings.pkl
77
+
78
+ 3. **Run the Web Interface:**
79
+ ```
80
+ streamlit run src/main.py
81
+ ```
82
+
83
+ 4. Using the Interface:
84
+
85
+ - Upload a query image using the file uploader
86
+ - Click "Search Similar Images"
87
+ - View the most similar images from your dataset
88
+
89
+
90
+
91
+ ## Technical Details
92
+ **Feature Extraction**
93
+ - Uses EfficientNetB0 without top layers
94
+ - Input image size: 224x224 pixels
95
+ - Output feature dimension: 1280
96
+
97
+ **Similarity Search**
98
+ - Uses FAISS IndexFlatL2 for L2 distance-based search
99
+ - Returns top-k most similar images (default k=5)
100
+
101
+ **Web Interface**
102
+ - Responsive design with Streamlit
103
+ - Displays original and similar images with similarity scores
104
+ - Progress tracking during processing
105
+
106
+ **Dependencies**
107
+ - TensorFlow 2.x
108
+ - FAISS-cpu (or FAISS-gpu for GPU support)
109
+ - Streamlit
110
+ - Pillow
111
+ - NumPy
112
+ - tqdm
113
+
114
+ **Performance**
115
+ - Feature extraction: ~1 second per image on CPU
116
+ - Similarity search: Near real-time for datasets up to 100k images
117
+ - Memory usage depends on dataset size (approximately 5KB per image embedding)
 
 
 
 
 
approach.txt ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ feature_extractor.py -
2
+ import tensorflow as tf
3
+ from tensorflow.keras.applications import EfficientNetB0
4
+ from tensorflow.keras.preprocessing import image
5
+ from tensorflow.keras.applications.efficientnet import preprocess_input
6
+ import numpy as np
7
+
8
+ class FeatureExtractor:
9
+ def __init__(self):
10
+ # Load pretrained EfficientNetB0 model without top layers
11
+ base_model = EfficientNetB0(weights='imagenet', include_top=False, pooling='avg')
12
+ self.model = tf.keras.Model(inputs=base_model.input, outputs=base_model.output)
13
+
14
+ def extract_features(self, img_path):
15
+ # Load and preprocess the image
16
+ img = image.load_img(img_path, target_size=(224, 224))
17
+ img_array = image.img_to_array(img)
18
+ expanded_img = np.expand_dims(img_array, axis=0)
19
+ preprocessed_img = preprocess_input(expanded_img)
20
+
21
+ # Extract features
22
+ features = self.model.predict(preprocessed_img)
23
+ return features.flatten()
24
+
25
+ preprocessing.py -
26
+ import os
27
+ import pickle
28
+ from .feature_extractor import FeatureExtractor
29
+ import time
30
+ from tqdm import tqdm
31
+
32
+ def precompute_embeddings(image_dir='data/images', output_path='data/embeddings.pkl'):
33
+ # Initialize the feature extractor
34
+ extractor = FeatureExtractor()
35
+
36
+ embeddings = []
37
+ image_paths = []
38
+
39
+ # Get total number of valid images
40
+ valid_images = [f for f in os.listdir(image_dir)
41
+ if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
42
+ total_images = len(valid_images)
43
+
44
+ print(f"\nFound {total_images} images to process")
45
+
46
+ # Estimate time (assuming ~1 second per image for EfficientNetB0)
47
+ estimated_time = total_images * 1 # 1 second per image
48
+ print(f"Estimated time: {estimated_time//60} minutes and {estimated_time%60} seconds\n")
49
+
50
+ # Use tqdm for progress bar
51
+ start_time = time.time()
52
+ for idx, filename in enumerate(tqdm(valid_images, desc="Processing images")):
53
+ if filename.endswith(('.png', '.jpg', '.jpeg')):
54
+ img_path = os.path.join(image_dir, filename)
55
+ try:
56
+ # Show current image being processed
57
+ print(f"\rProcessing image {idx+1}/{total_images}: {filename}", end="")
58
+
59
+ embedding = extractor.extract_features(img_path)
60
+ embeddings.append(embedding)
61
+ image_paths.append(img_path)
62
+
63
+ # Calculate and show remaining time
64
+ elapsed_time = time.time() - start_time
65
+ avg_time_per_image = elapsed_time / (idx + 1)
66
+ remaining_images = total_images - (idx + 1)
67
+ estimated_remaining_time = remaining_images * avg_time_per_image
68
+
69
+ print(f" | Remaining time: {estimated_remaining_time//60:.0f}m {estimated_remaining_time%60:.0f}s")
70
+
71
+ except Exception as e:
72
+ print(f"\nError processing {filename}: {e}")
73
+
74
+ # Save embeddings and paths
75
+ with open(output_path, 'wb') as f:
76
+ pickle.dump({'embeddings': embeddings, 'image_paths': image_paths}, f)
77
+
78
+ total_time = time.time() - start_time
79
+ print(f"\nProcessing complete!")
80
+ print(f"Total time taken: {total_time//60:.0f} minutes and {total_time%60:.0f} seconds")
81
+ print(f"Successfully processed {len(embeddings)}/{total_images} images")
82
+ print(f"Embeddings saved to {output_path}")
83
+
84
+ return embeddings, image_paths
85
+
86
+ if __name__ == "__main__":
87
+ precompute_embeddings()
88
+
89
+
90
+ similarity_search.py -
91
+ import faiss
92
+ import numpy as np
93
+ import pickle
94
+ import os
95
+
96
+ class SimilaritySearchEngine:
97
+ def __init__(self, embeddings_path='data/embeddings.pkl'):
98
+ # Load precomputed embeddings
99
+ with open(embeddings_path, 'rb') as f:
100
+ data = pickle.load(f)
101
+ self.embeddings = data['embeddings']
102
+ self.image_paths = data['image_paths']
103
+
104
+ # Create FAISS index
105
+ dimension = len(self.embeddings[0])
106
+ self.index = faiss.IndexFlatL2(dimension)
107
+ self.index.add(np.array(self.embeddings))
108
+
109
+ def search_similar_images(self, query_embedding, top_k=5):
110
+ # Perform similarity search
111
+ distances, indices = self.index.search(np.array([query_embedding]), top_k)
112
+ return [self.image_paths[idx] for idx in indices[0]], distances[0]
113
+
114
+
115
+ app.py -
116
+ import streamlit as st
117
+ from PIL import Image
118
+ from src.feature_extractor import FeatureExtractor
119
+ from src.similarity_search import SimilaritySearchEngine
120
+
121
+ def main():
122
+ st.title('Image Similarity Search')
123
+
124
+ # Upload query image
125
+ uploaded_file = st.file_uploader("Choose an image...", type=["jpg", "png", "jpeg"])
126
+
127
+ if uploaded_file is not None:
128
+ # Load the uploaded image
129
+ query_img = Image.open(uploaded_file)
130
+
131
+ # Resize and display the query image
132
+ query_img_resized = query_img.resize((263, 385))
133
+ st.image(query_img_resized, caption='Uploaded Image', use_container_width=False)
134
+
135
+ # Feature extraction and similarity search
136
+ if st.button("Search Similar Images"):
137
+ with st.spinner("Analyzing query image..."):
138
+ try:
139
+ # Initialize feature extractor and search engine
140
+ extractor = FeatureExtractor()
141
+ search_engine = SimilaritySearchEngine()
142
+
143
+ # Save the uploaded image temporarily
144
+ query_img_path = 'temp_query_image.jpg'
145
+ query_img.save(query_img_path)
146
+
147
+ # Extract features from the query image
148
+ query_embedding = extractor.extract_features(query_img_path)
149
+
150
+ # Perform similarity search
151
+ similar_images, distances = search_engine.search_similar_images(query_embedding)
152
+
153
+ # Display similar images
154
+ st.subheader('Similar Images')
155
+ cols = st.columns(len(similar_images))
156
+ for i, (img_path, dist) in enumerate(zip(similar_images, distances)):
157
+ with cols[i]:
158
+ similar_img = Image.open(img_path).resize((375, 550))
159
+ st.image(similar_img, caption=f'Distance: {dist:.2f}', use_container_width=True)
160
+
161
+ except Exception as e:
162
+ st.error(f"Error during similarity search: {e}")
163
+
164
+ if __name__ == '__main__':
165
+ main()
166
+
requirements.txt CHANGED
@@ -1,10 +1,7 @@
1
- tensorflow
2
- numpy
3
- opencv-python
4
- scikit-learn
5
- streamlit
6
- Pillow
7
- faiss-cpu
8
- python-dotenv
9
- matplotlib
10
- pandas
 
1
+ tensorflow
2
+ numpy
3
+ opencv-python
4
+ scikit-learn
5
+ streamlit
6
+ Pillow
7
+ faiss-cpu
 
 
 
src/preprocessing.py CHANGED
@@ -1,44 +1,62 @@
1
- import os
2
- import pickle
3
- from .feature_extractor import FeatureExtractor
4
- import time
5
- from tqdm import tqdm
6
-
7
- def precompute_embeddings():
8
- # Use absolute paths for Hugging Face Spaces
9
- base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
10
- image_dir = os.path.join(base_dir, 'data', 'images')
11
- output_path = os.path.join(base_dir, 'data', 'embeddings.pkl')
12
-
13
- # Create directories if they don't exist
14
- os.makedirs(image_dir, exist_ok=True)
15
- os.makedirs(os.path.dirname(output_path), exist_ok=True)
16
-
17
- # Rest of your existing code...
18
- extractor = FeatureExtractor()
19
- embeddings = []
20
- image_paths = []
21
-
22
- valid_images = [f for f in os.listdir(image_dir)
23
- if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
24
- total_images = len(valid_images)
25
-
26
- print(f"\nFound {total_images} images to process")
27
-
28
- start_time = time.time()
29
- for idx, filename in enumerate(tqdm(valid_images, desc="Processing images")):
30
- img_path = os.path.join(image_dir, filename)
31
- try:
32
- embedding = extractor.extract_features(img_path)
33
- embeddings.append(embedding)
34
- image_paths.append(img_path)
35
- except Exception as e:
36
- print(f"\nError processing {filename}: {e}")
37
-
38
- with open(output_path, 'wb') as f:
39
- pickle.dump({'embeddings': embeddings, 'image_paths': image_paths}, f)
40
-
41
- print(f"\nProcessing complete!")
42
- print(f"Successfully processed {len(embeddings)}/{total_images} images")
43
-
44
- return embeddings, image_paths
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import pickle
3
+ from .feature_extractor import FeatureExtractor
4
+ import time
5
+ from tqdm import tqdm
6
+
7
+ def precompute_embeddings(image_dir='data/images', output_path='data/embeddings.pkl'):
8
+ # Initialize the feature extractor
9
+ extractor = FeatureExtractor()
10
+
11
+ embeddings = []
12
+ image_paths = []
13
+
14
+ # Get total number of valid images
15
+ valid_images = [f for f in os.listdir(image_dir)
16
+ if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
17
+ total_images = len(valid_images)
18
+
19
+ print(f"\nFound {total_images} images to process")
20
+
21
+ # Estimate time (assuming ~1 second per image for EfficientNetB0)
22
+ estimated_time = total_images * 1 # 1 second per image
23
+ print(f"Estimated time: {estimated_time//60} minutes and {estimated_time%60} seconds\n")
24
+
25
+ # Use tqdm for progress bar
26
+ start_time = time.time()
27
+ for idx, filename in enumerate(tqdm(valid_images, desc="Processing images")):
28
+ if filename.endswith(('.png', '.jpg', '.jpeg')):
29
+ img_path = os.path.join(image_dir, filename)
30
+ try:
31
+ # Show current image being processed
32
+ print(f"\rProcessing image {idx+1}/{total_images}: {filename}", end="")
33
+
34
+ embedding = extractor.extract_features(img_path)
35
+ embeddings.append(embedding)
36
+ image_paths.append(img_path)
37
+
38
+ # Calculate and show remaining time
39
+ elapsed_time = time.time() - start_time
40
+ avg_time_per_image = elapsed_time / (idx + 1)
41
+ remaining_images = total_images - (idx + 1)
42
+ estimated_remaining_time = remaining_images * avg_time_per_image
43
+
44
+ print(f" | Remaining time: {estimated_remaining_time//60:.0f}m {estimated_remaining_time%60:.0f}s")
45
+
46
+ except Exception as e:
47
+ print(f"\nError processing {filename}: {e}")
48
+
49
+ # Save embeddings and paths
50
+ with open(output_path, 'wb') as f:
51
+ pickle.dump({'embeddings': embeddings, 'image_paths': image_paths}, f)
52
+
53
+ total_time = time.time() - start_time
54
+ print(f"\nProcessing complete!")
55
+ print(f"Total time taken: {total_time//60:.0f} minutes and {total_time%60:.0f} seconds")
56
+ print(f"Successfully processed {len(embeddings)}/{total_images} images")
57
+ print(f"Embeddings saved to {output_path}")
58
+
59
+ return embeddings, image_paths
60
+
61
+ if __name__ == "__main__":
62
+ precompute_embeddings()
src/similarity_search.py CHANGED
@@ -1,24 +1,22 @@
1
- import faiss
2
- import numpy as np
3
- import pickle
4
- import os
5
-
6
- class SimilaritySearchEngine:
7
- def __init__(self):
8
- base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
9
- embeddings_path = os.path.join(base_dir, 'data', 'embeddings.pkl')
10
-
11
- with open(embeddings_path, 'rb') as f:
12
- data = pickle.load(f)
13
- self.embeddings = data['embeddings']
14
- # Convert Windows paths to Linux paths
15
- self.image_paths = [os.path.normpath(path).replace('\\', '/')
16
- for path in data['image_paths']]
17
-
18
- dimension = len(self.embeddings[0])
19
- self.index = faiss.IndexFlatL2(dimension)
20
- self.index.add(np.array(self.embeddings))
21
-
22
- def search_similar_images(self, query_embedding, top_k=5):
23
- distances, indices = self.index.search(np.array([query_embedding]), top_k)
24
  return [self.image_paths[idx] for idx in indices[0]], distances[0]
 
1
+ import faiss
2
+ import numpy as np
3
+ import pickle
4
+ import os
5
+
6
+ class SimilaritySearchEngine:
7
+ def __init__(self, embeddings_path='data/embeddings.pkl'):
8
+ # Load precomputed embeddings
9
+ with open(embeddings_path, 'rb') as f:
10
+ data = pickle.load(f)
11
+ self.embeddings = data['embeddings']
12
+ self.image_paths = data['image_paths']
13
+
14
+ # Create FAISS index
15
+ dimension = len(self.embeddings[0])
16
+ self.index = faiss.IndexFlatL2(dimension)
17
+ self.index.add(np.array(self.embeddings))
18
+
19
+ def search_similar_images(self, query_embedding, top_k=5):
20
+ # Perform similarity search
21
+ distances, indices = self.index.search(np.array([query_embedding]), top_k)
 
 
22
  return [self.image_paths[idx] for idx in indices[0]], distances[0]