billyxx commited on
Commit
d5c1c41
Β·
verified Β·
1 Parent(s): b8538a6

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +20 -3
  2. app.py +17 -1
  3. requirements.txt +1 -0
README.md CHANGED
@@ -18,7 +18,7 @@ This Candidate Recommendation Engine ranks and summarizes resumes against a give
18
  The application:
19
 
20
  - Accepts a **job description** (text input).
21
- - Accepts **multiple resumes** (PDF, DOCX, or TXT files).
22
  - Extracts and cleans resume text.
23
  - Generates semantic embeddings using **sentence-transformers**.
24
  - Calculates **cosine similarity** between each resume and the job description.
@@ -33,6 +33,24 @@ The application:
33
 
34
  ## πŸ›  Approach
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  - **Text Extraction**
37
  - PDF resumes β†’ parsed using PyPDF2
38
  - DOCX resumes β†’ parsed using python-docx
@@ -71,8 +89,7 @@ The application:
71
  ---
72
 
73
  ## ⚠ Limitations
74
-
75
- - Cannot perfectly handle image-based (scanned) resumes without OCR.
76
  - Candidate name extraction may fail if resumes have unconventional formatting.
77
  - LLM summaries depend on model capability β€” may occasionally be generic.
78
  - Cosine similarity does not account for specific skill weights (all terms treated equally).
 
18
  The application:
19
 
20
  - Accepts a **job description** (text input).
21
+ - Accepts **multiple resumes** (PDF or TXT files).
22
  - Extracts and cleans resume text.
23
  - Generates semantic embeddings using **sentence-transformers**.
24
  - Calculates **cosine similarity** between each resume and the job description.
 
33
 
34
  ## πŸ›  Approach
35
 
36
+ ## AI Summarization
37
+
38
+ I chose **not to use GPT API or Gemini API** for the AI-powered candidate summary because
39
+ I wanted to explore alternative options and deepen my understanding of
40
+ open-source large language models (LLMs).
41
+
42
+ After experimenting with several LLM models, I finalized on using the **MBZUAI/LaMini-Flan-T5-248M**
43
+ model for generating candidate summaries. This model provides a good balance of performance
44
+ and efficiency for summarization tasks, and working with it has helped me learn more about
45
+ LLMs outside of the popular API-based services.
46
+
47
+ ## Embeddings
48
+ - For generating semantic embeddings to measure resume-job description similarity,
49
+ I used **all-mpnet-base-v2** from the sentence-transformers library. This model
50
+ provided better cosine similarity results compared to other embedding models I tested,
51
+ making the ranking of candidates more accurate and relevant.
52
+
53
+
54
  - **Text Extraction**
55
  - PDF resumes β†’ parsed using PyPDF2
56
  - DOCX resumes β†’ parsed using python-docx
 
89
  ---
90
 
91
  ## ⚠ Limitations
92
+
 
93
  - Candidate name extraction may fail if resumes have unconventional formatting.
94
  - LLM summaries depend on model capability β€” may occasionally be generic.
95
  - Cosine similarity does not account for specific skill weights (all terms treated equally).
app.py CHANGED
@@ -2,6 +2,8 @@ import gradio as gr
2
  import os
3
  import pdfplumber
4
  from recommender import rank_resumes, summarize_resume_flan, extract_applicant_name
 
 
5
 
6
  UPLOAD_FOLDER = "uploads"
7
  os.makedirs(UPLOAD_FOLDER, exist_ok=True)
@@ -26,6 +28,8 @@ def process_resumes(job_description, uploaded_file):
26
  with pdfplumber.open(filepath) as pdf:
27
  pages = [page.extract_text() for page in pdf.pages if page.extract_text() is not None]
28
  text = "\n".join(pages)
 
 
29
  else:
30
  return "Unsupported file format.", None
31
 
@@ -34,6 +38,9 @@ def process_resumes(job_description, uploaded_file):
34
  # Rank resumes
35
  results = rank_resumes(job_description, resume_texts)
36
 
 
 
 
37
  # Generate summaries
38
  for candidate in results:
39
  candidate["summary"] = summarize_resume_flan(candidate["text"], job_description)
@@ -50,6 +57,14 @@ def process_resumes(job_description, uploaded_file):
50
 
51
  return "", table_data
52
 
 
 
 
 
 
 
 
 
53
 
54
 
55
 
@@ -58,7 +73,8 @@ with gr.Blocks() as demo:
58
  with gr.Row():
59
  job_desc = gr.Textbox(label="Job Description", lines=10, placeholder="Paste job description here...")
60
 
61
- resumes = gr.File(label="Upload Resume (.txt or .pdf)", file_types=[".txt", ".pdf"])
 
62
  btn = gr.Button("Rank Candidates")
63
 
64
 
 
2
  import os
3
  import pdfplumber
4
  from recommender import rank_resumes, summarize_resume_flan, extract_applicant_name
5
+ from docx import Document
6
+
7
 
8
  UPLOAD_FOLDER = "uploads"
9
  os.makedirs(UPLOAD_FOLDER, exist_ok=True)
 
28
  with pdfplumber.open(filepath) as pdf:
29
  pages = [page.extract_text() for page in pdf.pages if page.extract_text() is not None]
30
  text = "\n".join(pages)
31
+ elif filepath.endswith(".docx"):
32
+ text = extract_text_from_docx(filepath)
33
  else:
34
  return "Unsupported file format.", None
35
 
 
38
  # Rank resumes
39
  results = rank_resumes(job_description, resume_texts)
40
 
41
+ for i, candidate in enumerate(results):
42
+ candidate["name"] = resume_texts[i][0]
43
+
44
  # Generate summaries
45
  for candidate in results:
46
  candidate["summary"] = summarize_resume_flan(candidate["text"], job_description)
 
57
 
58
  return "", table_data
59
 
60
+ def extract_text_from_docx(filepath):
61
+ doc = Document(filepath)
62
+ full_text = []
63
+ for para in doc.paragraphs:
64
+ full_text.append(para.text)
65
+ return "\n".join(full_text)
66
+
67
+
68
 
69
 
70
 
 
73
  with gr.Row():
74
  job_desc = gr.Textbox(label="Job Description", lines=10, placeholder="Paste job description here...")
75
 
76
+ resumes = gr.File(label="Upload Resume (.txt, .pdf, .docx)", file_types=[".txt", ".pdf", ".docx"])
77
+
78
  btn = gr.Button("Rank Candidates")
79
 
80
 
requirements.txt CHANGED
@@ -6,4 +6,5 @@ transformers
6
  accelerate
7
  torch
8
  pdfplumber
 
9
 
 
6
  accelerate
7
  torch
8
  pdfplumber
9
+ python-docx
10