forestav commited on
Commit
1dc2051
·
2 Parent(s): f51752f 012ba72

Merge branch 'main' of https://github.com/filiporestav/jobsai

Browse files
Files changed (1) hide show
  1. README.md +24 -13
README.md CHANGED
@@ -13,8 +13,8 @@ pinned: false
13
 
14
  This repository contains the final project for the course **ID2223 Scalable Machine Learning and Deep Learning** at KTH.
15
 
16
- The project culminates in an AI-powered job matching platform, **JobsAI**, designed to help users find job listings tailored to their resumes. The application is hosted on Streamlit Community Cloud and can be accessed here:
17
- [**JobsAI**](https://jobsai.streamlit.app/)
18
 
19
  ---
20
 
@@ -49,8 +49,10 @@ The platform uses two primary data sources:
49
  ### Tool Selection
50
 
51
  - **Vector Database**: After evaluating several options, we chose **Pinecone** for its ease of use and targeted support for vector embeddings.
52
- - **Embedding Model**: We used [**sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), a pre-trained transformer model that encodes sentences and paragraphs into a 384-dimensional dense vector space.
 
53
  - **Backend Updates**: GitHub Actions was utilized to automate daily updates to the vector database.
 
54
 
55
  ### Workflow
56
 
@@ -62,6 +64,12 @@ The platform uses two primary data sources:
62
  2. **Similarity Search**:
63
  - User-uploaded resumes are vectorized using the same sentence transformer model.
64
  - Pinecone is queried for the top-k most similar job embeddings, which are then displayed to the user alongside their similarity scores.
 
 
 
 
 
 
65
 
66
  ---
67
 
@@ -81,6 +89,11 @@ The platform uses two primary data sources:
81
  - **Automated Workflow**: A GitHub Actions workflow runs `main.py` daily at midnight.
82
  - **Incremental Updates**: The `keep_updated.py` function fetches job listings updated since the last recorded timestamp, ensuring the vector database remains current.
83
 
 
 
 
 
 
84
  ### Querying for Matches
85
 
86
  - When a user uploads their resume:
@@ -96,6 +109,8 @@ The platform uses two primary data sources:
96
  1. Python 3.x installed locally.
97
  2. A [Pinecone](https://www.pinecone.io/) account and API key.
98
  3. Arbetsförmedlingen JobStream API access (free).
 
 
99
 
100
  ### Steps
101
 
@@ -108,15 +123,17 @@ The platform uses two primary data sources:
108
  ```bash
109
  pip install -r requirements.txt
110
  ```
111
- 3. Add your Pinecone API key as an environment variable:
112
  ```bash
113
  export PINECONE_API_KEY=<your-api-key>
 
 
114
  ```
115
  4. Run the application locally:
116
  ```bash
117
- streamlit run app.py
118
  ```
119
- 5. Open the Streamlit app in your browser to upload resumes and view job recommendations.
120
 
121
  ## Potential Improvements
122
 
@@ -125,12 +142,6 @@ The platform uses two primary data sources:
125
  - The current embedding model truncates text longer than 128 tokens.
126
  - For longer job descriptions, a model capable of processing more tokens (e.g., 512 or 1024) could improve accuracy.
127
 
128
- ### Active Learning
129
-
130
- - Adding a feedback loop for users to label jobs as "Relevant" or "Not Relevant" could fine-tune the model.
131
- - Limitations in Streamlit’s reactivity make it unsuitable for collecting real-time feedback.
132
- - A future iteration could use **React** for a more seamless UI experience.
133
-
134
  ### Scalability
135
 
136
  - Embedding and querying currently run on CPU, which may limit performance for larger datasets.
@@ -142,6 +153,6 @@ The platform uses two primary data sources:
142
 
143
  **JobsAI** is a proof-of-concept platform that demonstrates how AI can revolutionize the job search experience. By leveraging vector embeddings and similarity search, the platform reduces inefficiencies and matches users with the most relevant job postings.
144
 
145
- While it is functional and effective as a prototype, there are ample opportunities for enhancement, particularly in scalability, UI design, and model fine-tuning.
146
 
147
  For a live demo, visit [**JobsAI**](https://jobsai.streamlit.app/).
 
13
 
14
  This repository contains the final project for the course **ID2223 Scalable Machine Learning and Deep Learning** at KTH.
15
 
16
+ The project culminates in an AI-powered job matching platform, **JobsAI**, designed to help users find job listings tailored to their resumes. The application is hosted on Gradio using HuggingFace Community Cloud and can be accessed here:
17
+ [**JobsAI**](https://huggingface.co/spaces/forestav/jobsai)
18
 
19
  ---
20
 
 
49
  ### Tool Selection
50
 
51
  - **Vector Database**: After evaluating several options, we chose **Pinecone** for its ease of use and targeted support for vector embeddings.
52
+ - **Embedding Model**: The base model is [**sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), a pre-trained transformer model that encodes sentences and paragraphs into a 384-dimensional dense vector space.
53
+ - **Finetuned Model**: The base model is finetuned on user-provided data every 7-days, and stored on HuggingFace. It can be found [**here!**](https://huggingface.co/forestav/job_matching_sentence_transformer)
54
  - **Backend Updates**: GitHub Actions was utilized to automate daily updates to the vector database.
55
+ - **Feature Store**: To store user provided data, we used **Hopsworks** as it allows for easy feature interaction, as well as allows us to save older models to evaluate performance over time.
56
 
57
  ### Workflow
58
 
 
64
  2. **Similarity Search**:
65
  - User-uploaded resumes are vectorized using the same sentence transformer model.
66
  - Pinecone is queried for the top-k most similar job embeddings, which are then displayed to the user alongside their similarity scores.
67
+
68
+ 3. **Feature Uploading**:
69
+ - If a user chooses to leave feedback, by either clicking *Relevant* or *Not Relevant*, the users CV is uploaded to Hopsworks together with the specific ad data, and the selected choice.
70
+
71
+ 4. **Model Training**:
72
+ - Once every seven days, a chrone job on *Github Actions* runs, where the base model is finetuned on the total data stored in the feature store.
73
 
74
  ---
75
 
 
89
  - **Automated Workflow**: A GitHub Actions workflow runs `main.py` daily at midnight.
90
  - **Incremental Updates**: The `keep_updated.py` function fetches job listings updated since the last recorded timestamp, ensuring the vector database remains current.
91
 
92
+ ### Weekly Updates
93
+
94
+ - **Automated Workflow**: A GitHub Actions workflow runs `training_pipeline.ipynb` every Sunday at midnight.
95
+ - **Model Training**: Features are downloaded from Hopsworks, and the base LLM is finetuned on the total dataset with both negative and positive examples.
96
+
97
  ### Querying for Matches
98
 
99
  - When a user uploads their resume:
 
109
  1. Python 3.x installed locally.
110
  2. A [Pinecone](https://www.pinecone.io/) account and API key.
111
  3. Arbetsförmedlingen JobStream API access (free).
112
+ 4. [Hopsworks](https://www.hopsworks.ai/) Account and API key.
113
+ 5. [Huggingface](https://huggingface.co/) Account and API key.
114
 
115
  ### Steps
116
 
 
123
  ```bash
124
  pip install -r requirements.txt
125
  ```
126
+ 3. Add your API keys as an environment variables:
127
  ```bash
128
  export PINECONE_API_KEY=<your-api-key>
129
+ export HOPSWORKS_API_KEY=<your-api-key>
130
+ export HUGGINGFACE_API_KEY=<your-api-key>
131
  ```
132
  4. Run the application locally:
133
  ```bash
134
+ gradio run app.py
135
  ```
136
+ 5. Open the Gradio app in your browser to upload resumes and view job recommendations.
137
 
138
  ## Potential Improvements
139
 
 
142
  - The current embedding model truncates text longer than 128 tokens.
143
  - For longer job descriptions, a model capable of processing more tokens (e.g., 512 or 1024) could improve accuracy.
144
 
 
 
 
 
 
 
145
  ### Scalability
146
 
147
  - Embedding and querying currently run on CPU, which may limit performance for larger datasets.
 
153
 
154
  **JobsAI** is a proof-of-concept platform that demonstrates how AI can revolutionize the job search experience. By leveraging vector embeddings and similarity search, the platform reduces inefficiencies and matches users with the most relevant job postings.
155
 
156
+ While it is functional and effective as a prototype, there are ample opportunities for enhancement, particularly in scalability and model capacity.
157
 
158
  For a live demo, visit [**JobsAI**](https://jobsai.streamlit.app/).