Spaces:
Sleeping
Sleeping
corrected tagging pipeline + updated README
Browse files- .github/workflows/training.yml +4 -4
- .github/workflows/visualization.yml +13 -0
- README.md +23 -29
- tag-posting.py +26 -10
- tags/03-01-2025/1.txt +26 -11
- tags/03-01-2025/2.txt +10 -2
- tags/03-01-2025/3.txt +14 -2
.github/workflows/training.yml
CHANGED
@@ -33,14 +33,14 @@ jobs:
|
|
33 |
python llm-tagging.py
|
34 |
python filter-faults.py
|
35 |
python train.py
|
36 |
-
- name: List
|
37 |
-
run: ls -R
|
38 |
- name: Commit and Push Changes
|
39 |
run: |
|
40 |
git config --global user.name "github-actions[bot]"
|
41 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
42 |
-
git add
|
43 |
-
git commit -m "
|
44 |
git push
|
45 |
env:
|
46 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
|
33 |
python llm-tagging.py
|
34 |
python filter-faults.py
|
35 |
python train.py
|
36 |
+
- name: List data folder
|
37 |
+
run: ls -R data || echo "data folder not found"
|
38 |
- name: Commit and Push Changes
|
39 |
run: |
|
40 |
git config --global user.name "github-actions[bot]"
|
41 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
42 |
+
git add data
|
43 |
+
git commit -m "LLM-generated tags uploaded"
|
44 |
git push
|
45 |
env:
|
46 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
.github/workflows/visualization.yml
CHANGED
@@ -27,6 +27,7 @@ jobs:
|
|
27 |
|
28 |
- name: Run Visualization Script
|
29 |
run: |
|
|
|
30 |
python embedding_gen.py
|
31 |
- name: List plots folder
|
32 |
run: ls -R plots || echo "plots not found"
|
@@ -34,9 +35,21 @@ jobs:
|
|
34 |
run: |
|
35 |
git config --global user.name "github-actions[bot]"
|
36 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
|
|
37 |
git add plots
|
38 |
git commit -m "Add plots generated by script"
|
39 |
git push
|
40 |
env:
|
41 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
- name: Run Visualization Script
|
29 |
run: |
|
30 |
+
python tag-posting.py
|
31 |
python embedding_gen.py
|
32 |
- name: List plots folder
|
33 |
run: ls -R plots || echo "plots not found"
|
|
|
35 |
run: |
|
36 |
git config --global user.name "github-actions[bot]"
|
37 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
38 |
+
git add tags
|
39 |
git add plots
|
40 |
git commit -m "Add plots generated by script"
|
41 |
git push
|
42 |
env:
|
43 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
44 |
|
45 |
+
sync-to-hub:
|
46 |
+
runs-on: ubuntu-latest
|
47 |
+
steps:
|
48 |
+
- uses: actions/checkout@v3
|
49 |
+
with:
|
50 |
+
fetch-depth: 0
|
51 |
+
lfs: true
|
52 |
+
- name: Push to hub
|
53 |
+
env:
|
54 |
+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
55 |
+
run: git push https://Robzy:[email protected]/spaces/Robzy/jobbert_knowledge_extraction main
|
README.md
CHANGED
@@ -17,37 +17,39 @@ This projects aims to monitor in-demand skills for machine learning roles. Skill
|
|
17 |
|
18 |

|
19 |
|
20 |
-
### [Monitoring Platform Link](https://huggingface.co/spaces/
|
21 |
|
22 |
## Architecture & Frameworks
|
23 |
|
24 |
-
|
25 |
-
- **
|
26 |
-
- **
|
27 |
-
- **
|
28 |
-
- **
|
29 |
-
- ** Weight & Biases **
|
30 |
-
- ** Rapid API **
|
31 |
-
- ** OpenAI API **
|
32 |
|
33 |
|
34 |
# High-Level Overview
|
35 |
|
36 |
-
##
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
|
42 |
-
##
|
43 |
-
Continual training, extract ground truth via LLM with multi-shot learning with examples.
|
44 |
|
45 |
-
|
46 |
-
Save all skills. Make a comprehensive overview by:
|
47 |
|
48 |
-
1.
|
49 |
-
|
50 |
-
2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
|
53 |
# Job Scraping
|
@@ -94,11 +96,3 @@ We generate embeddings for technical skills listed in .txt files and visualizes
|
|
94 |
- **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
|
95 |
- **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
|
96 |
|
97 |
-
# Scheduling
|
98 |
-
|
99 |
-
To monitor the in-demand skills and update our model continously, scheduling is employed. The following scripts are scheduled every Sunday:
|
100 |
-
|
101 |
-
1. Job-posting scraping: fetching job descriptions for machine learning from LinkedIn
|
102 |
-
2. Skills tagging with LLM: we decide to extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engeneering.
|
103 |
-
3. Training
|
104 |
-
4. Embedding and visualizatio - skills are embedded and visualized with KMeans clustering
|
|
|
17 |
|
18 |

|
19 |
|
20 |
+
### [Monitoring Platform Link](https://huggingface.co/spaces/Robzy/jobbert_knowledge_extraction)
|
21 |
|
22 |
## Architecture & Frameworks
|
23 |
|
24 |
+
- **Hugging Face Spaces**: Used as an UI to host interactive visualisation of skills embeddings and their clusters.
|
25 |
+
- **GitHub Actions**: Used to schedule training, inference and visualisation-updating scripts.
|
26 |
+
- **Rapid API**: The API used to scrape job descriptions from LinkedIn
|
27 |
+
- **Weights & Biases**: Used for model training monitoring as well as model storing.
|
28 |
+
- **OpenAI API**: Used to extract ground-truth from job descriptions by leveraging multi-shot learning and prompt engineering.
|
|
|
|
|
|
|
29 |
|
30 |
|
31 |
# High-Level Overview
|
32 |
|
33 |
+
## Models
|
34 |
+
* **BERT** - finetuned skill extraction model, lightweight.
|
35 |
+
* **LLM** - gpt-4o for skill extraction with multi-shot learning. Computationally expensive.
|
36 |
+
* **Embedding model** - [SentenceTransformers](https://sbert.net/) used to embed skills into vectors.
|
37 |
+
* [**spaCy**](https://spacy.io/models/en#en_core_web_sm) - sentence tokenization model.
|
38 |
|
39 |
+
## Pipeline
|
|
|
40 |
|
41 |
+
The follow scripts are scheduled to automate the skill monitoring and model tranining processes continually.
|
|
|
42 |
|
43 |
+
### 1. Job-posting scraping
|
44 |
+
Fetching job descriptions for machine learning from LinkedIn via Rapid API.
|
45 |
+
### 2. Skills tagging with LLM
|
46 |
+
We opinionately extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engineering.
|
47 |
+
### 3. Model training
|
48 |
+
The skill extraction model is finetuned with respect to the extracted groundtruth.
|
49 |
+
### 4. Skills tagging with JobBERT
|
50 |
+
Skills are extracted from job-postings with finetuned model
|
51 |
+
### 5. Embedding & visualization
|
52 |
+
Extracted skills are embedded, reduced and clustered with an embedding model, UMAP and K-means respectively.
|
53 |
|
54 |
|
55 |
# Job Scraping
|
|
|
96 |
- **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
|
97 |
- **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tag-posting.py
CHANGED
@@ -4,6 +4,7 @@ from transformers import AutoTokenizer, BertForTokenClassification, TrainingArgu
|
|
4 |
import torch
|
5 |
from typing import List
|
6 |
import os
|
|
|
7 |
|
8 |
|
9 |
### Parsing job posting
|
@@ -215,18 +216,33 @@ def backfill():
|
|
215 |
|
216 |
print(f"Saved skills to: {tag_path}")
|
217 |
|
218 |
-
def tag_date():
|
219 |
|
220 |
-
|
|
|
221 |
|
222 |
-
|
|
|
|
|
|
|
|
|
|
|
223 |
|
224 |
-
|
225 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
226 |
|
|
|
|
|
227 |
|
228 |
-
#
|
229 |
-
|
230 |
-
|
231 |
-
# skills_save('./tags/03-01-2024/2.txt',skills)RAPID_API_KEY : 60a10b11e6msh821d32f6e1e955ep15b5b1jsnf61a46680409
|
232 |
-
1
|
|
|
4 |
import torch
|
5 |
from typing import List
|
6 |
import os
|
7 |
+
from datetime import datetime
|
8 |
|
9 |
|
10 |
### Parsing job posting
|
|
|
216 |
|
217 |
print(f"Saved skills to: {tag_path}")
|
218 |
|
219 |
+
def tag_date(date):
|
220 |
|
221 |
+
tag_dir = os.path.join(os.getcwd(), 'tags', date)
|
222 |
+
job_dir = os.path.join(os.getcwd(), 'job-postings', date)
|
223 |
|
224 |
+
for job in os.listdir(job_dir):
|
225 |
+
|
226 |
+
job_path = os.path.join(job_dir, job)
|
227 |
+
tag_path = os.path.join(tag_dir, job)
|
228 |
+
|
229 |
+
print(f"Processing job file: {job_path}")
|
230 |
|
231 |
+
if not os.path.exists(tag_dir):
|
232 |
+
os.makedirs(tag_dir)
|
233 |
+
print(f"Created directory: {tag_dir}")
|
234 |
+
|
235 |
+
sents = parse_post(job_path)
|
236 |
+
skills = extract_skills(sents)
|
237 |
+
skills_save(tag_path, skills)
|
238 |
+
|
239 |
+
print(f"Saved skills to: {tag_path}")
|
240 |
+
|
241 |
+
if __name__ == '__main__':
|
242 |
|
243 |
+
# Backfill all job postings
|
244 |
+
# backfill()
|
245 |
|
246 |
+
# Tag today's job postings
|
247 |
+
date = datetime.today().strftime('%m-%d-%Y')
|
248 |
+
tag_date(date)
|
|
|
|
tags/03-01-2025/1.txt
CHANGED
@@ -1,33 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
ML
|
2 |
-
|
3 |
-
|
|
|
|
|
4 |
MSc in Data Science
|
5 |
Python
|
6 |
Go
|
7 |
MLOps
|
8 |
MLFlow
|
9 |
-
Kubeflow
|
|
|
10 |
Hydra
|
11 |
numpy
|
12 |
TensorFlow
|
13 |
-
|
14 |
CI
|
15 |
-
/
|
16 |
CD
|
17 |
-
|
18 |
-
pipeline
|
19 |
-
testing
|
20 |
ML
|
21 |
ML
|
22 |
PyTorch
|
23 |
TensorFlow
|
24 |
-
|
25 |
-
|
26 |
Docker
|
27 |
Kaniko
|
28 |
Kubernetes
|
29 |
Helm
|
30 |
-
Cloud
|
31 |
AWS
|
32 |
Infrastructure management
|
33 |
Ansible
|
|
|
1 |
+
Ericsson
|
2 |
+
mobile
|
3 |
+
Ericsson
|
4 |
+
Ericsson
|
5 |
+
Networked Society
|
6 |
+
Eric
|
7 |
+
6
|
8 |
+
6G
|
9 |
+
regulation
|
10 |
+
standardization
|
11 |
+
6th
|
12 |
+
cloud
|
13 |
+
cloud
|
14 |
+
MLOps
|
15 |
ML
|
16 |
+
AI
|
17 |
+
cloud
|
18 |
+
6G
|
19 |
+
standard
|
20 |
MSc in Data Science
|
21 |
Python
|
22 |
Go
|
23 |
MLOps
|
24 |
MLFlow
|
25 |
+
Kubeflow
|
26 |
+
Python
|
27 |
Hydra
|
28 |
numpy
|
29 |
TensorFlow
|
30 |
+
Dev
|
31 |
CI
|
|
|
32 |
CD
|
33 |
+
deployment
|
34 |
+
pipeline
|
|
|
35 |
ML
|
36 |
ML
|
37 |
PyTorch
|
38 |
TensorFlow
|
39 |
+
Jax
|
40 |
+
Con
|
41 |
Docker
|
42 |
Kaniko
|
43 |
Kubernetes
|
44 |
Helm
|
45 |
+
Cloud
|
46 |
AWS
|
47 |
Infrastructure management
|
48 |
Ansible
|
tags/03-01-2025/2.txt
CHANGED
@@ -3,11 +3,19 @@ Automation
|
|
3 |
data analysis
|
4 |
image recognition
|
5 |
automation
|
|
|
6 |
Artificial Intelligence
|
7 |
feasibility studies
|
|
|
|
|
|
|
|
|
8 |
data analysis
|
9 |
Data Science
|
10 |
-
degree
|
|
|
11 |
Artificial Intelligence
|
12 |
Vision Systems
|
13 |
-
|
|
|
|
|
|
3 |
data analysis
|
4 |
image recognition
|
5 |
automation
|
6 |
+
Transformers
|
7 |
Artificial Intelligence
|
8 |
feasibility studies
|
9 |
+
AI
|
10 |
+
industry
|
11 |
+
.
|
12 |
+
operational
|
13 |
data analysis
|
14 |
Data Science
|
15 |
+
degree
|
16 |
+
software engineering
|
17 |
Artificial Intelligence
|
18 |
Vision Systems
|
19 |
+
project
|
20 |
+
English
|
21 |
+
Con
|
tags/03-01-2025/3.txt
CHANGED
@@ -1,10 +1,20 @@
|
|
|
|
|
|
1 |
SQL
|
2 |
cloud infrastructure
|
3 |
APIs
|
|
|
|
|
4 |
Python
|
5 |
infra
|
6 |
database
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
SaaS
|
9 |
agile development
|
10 |
sprint planning
|
@@ -19,4 +29,6 @@ cloud environments
|
|
19 |
Azure
|
20 |
data processing
|
21 |
Databricks
|
22 |
-
English
|
|
|
|
|
|
1 |
+
data
|
2 |
+
web
|
3 |
SQL
|
4 |
cloud infrastructure
|
5 |
APIs
|
6 |
+
data
|
7 |
+
Market
|
8 |
Python
|
9 |
infra
|
10 |
database
|
11 |
+
scraping
|
12 |
+
Python
|
13 |
+
cloud
|
14 |
+
APIs
|
15 |
+
Typescript
|
16 |
+
node
|
17 |
+
anals
|
18 |
SaaS
|
19 |
agile development
|
20 |
sprint planning
|
|
|
29 |
Azure
|
30 |
data processing
|
31 |
Databricks
|
32 |
+
English
|
33 |
+
T
|
34 |
+
contract
|