Spaces:

billyxx
/

Sprouts_Assignment

Running

App Files Files Community

billyxx commited on 14 days ago

Commit

d5c1c41

verified ·

1 Parent(s): b8538a6

Upload 3 files

Browse files

Files changed (3) hide show

README.md +20 -3
app.py +17 -1
requirements.txt +1 -0

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ This Candidate Recommendation Engine ranks and summarizes resumes against a give
 The application:
 - Accepts a **job description** (text input).
-- Accepts **multiple resumes** (PDF, DOCX, or TXT files).
 - Extracts and cleans resume text.
 - Generates semantic embeddings using **sentence-transformers**.
 - Calculates **cosine similarity** between each resume and the job description.
@@ -33,6 +33,24 @@ The application:
 ## 🛠 Approach
 - **Text Extraction**
   - PDF resumes → parsed using PyPDF2
   - DOCX resumes → parsed using python-docx
@@ -71,8 +89,7 @@ The application:
 ---
 ## ⚠ Limitations
-- Cannot perfectly handle image-based (scanned) resumes without OCR.
 - Candidate name extraction may fail if resumes have unconventional formatting.
 - LLM summaries depend on model capability — may occasionally be generic.
 - Cosine similarity does not account for specific skill weights (all terms treated equally).

 The application:
 - Accepts a **job description** (text input).
+- Accepts **multiple resumes** (PDF or TXT files).
 - Extracts and cleans resume text.
 - Generates semantic embeddings using **sentence-transformers**.
 - Calculates **cosine similarity** between each resume and the job description.
 ## 🛠 Approach
+##  AI Summarization
+I chose **not to use GPT API or Gemini API** for the AI-powered candidate summary because
+I wanted to explore alternative options and deepen my understanding of
+open-source large language models (LLMs).
+After experimenting with several LLM models, I finalized on using the **MBZUAI/LaMini-Flan-T5-248M**
+model for generating candidate summaries. This model provides a good balance of performance
+and efficiency for summarization tasks, and working with it has helped me learn more about
+LLMs outside of the popular API-based services.
+##  Embeddings
+- For generating semantic embeddings to measure resume-job description similarity,
+I used **all-mpnet-base-v2** from the sentence-transformers library. This model
+provided better cosine similarity results compared to other embedding models I tested,
+making the ranking of candidates more accurate and relevant.
 - **Text Extraction**
   - PDF resumes → parsed using PyPDF2
   - DOCX resumes → parsed using python-docx
 ---
 ## ⚠ Limitations
 - Candidate name extraction may fail if resumes have unconventional formatting.
 - LLM summaries depend on model capability — may occasionally be generic.
 - Cosine similarity does not account for specific skill weights (all terms treated equally).

app.py CHANGED Viewed

@@ -2,6 +2,8 @@ import gradio as gr
 import os
 import pdfplumber
 from recommender import rank_resumes, summarize_resume_flan, extract_applicant_name
 UPLOAD_FOLDER = "uploads"
 os.makedirs(UPLOAD_FOLDER, exist_ok=True)
@@ -26,6 +28,8 @@ def process_resumes(job_description, uploaded_file):
         with pdfplumber.open(filepath) as pdf:
             pages = [page.extract_text() for page in pdf.pages if page.extract_text() is not None]
             text = "\n".join(pages)
     else:
         return "Unsupported file format.", None
@@ -34,6 +38,9 @@ def process_resumes(job_description, uploaded_file):
     # Rank resumes
     results = rank_resumes(job_description, resume_texts)
     # Generate summaries
     for candidate in results:
         candidate["summary"] = summarize_resume_flan(candidate["text"], job_description)
@@ -50,6 +57,14 @@ def process_resumes(job_description, uploaded_file):
     return "", table_data
@@ -58,7 +73,8 @@ with gr.Blocks() as demo:
     with gr.Row():
         job_desc = gr.Textbox(label="Job Description", lines=10, placeholder="Paste job description here...")
-    resumes = gr.File(label="Upload Resume (.txt or .pdf)", file_types=[".txt", ".pdf"])
     btn = gr.Button("Rank Candidates")

 import os
 import pdfplumber
 from recommender import rank_resumes, summarize_resume_flan, extract_applicant_name
+from docx import Document
 UPLOAD_FOLDER = "uploads"
 os.makedirs(UPLOAD_FOLDER, exist_ok=True)
         with pdfplumber.open(filepath) as pdf:
             pages = [page.extract_text() for page in pdf.pages if page.extract_text() is not None]
             text = "\n".join(pages)
+    elif filepath.endswith(".docx"):
+        text = extract_text_from_docx(filepath)
     else:
         return "Unsupported file format.", None
     # Rank resumes
     results = rank_resumes(job_description, resume_texts)
+    for i, candidate in enumerate(results):
+        candidate["name"] = resume_texts[i][0]
     # Generate summaries
     for candidate in results:
         candidate["summary"] = summarize_resume_flan(candidate["text"], job_description)
     return "", table_data
+def extract_text_from_docx(filepath):
+    doc = Document(filepath)
+    full_text = []
+    for para in doc.paragraphs:
+        full_text.append(para.text)
+    return "\n".join(full_text)
     with gr.Row():
         job_desc = gr.Textbox(label="Job Description", lines=10, placeholder="Paste job description here...")
+    resumes = gr.File(label="Upload Resume (.txt, .pdf, .docx)", file_types=[".txt", ".pdf", ".docx"])
     btn = gr.Button("Rank Candidates")

requirements.txt CHANGED Viewed

@@ -6,4 +6,5 @@ transformers
 accelerate
 torch
 pdfplumber

 accelerate
 torch
 pdfplumber
+python-docx