Robzy commited on
Commit
113c4ac
·
1 Parent(s): e5096ab

corrected tagging pipeline + updated README

Browse files
.github/workflows/training.yml CHANGED
@@ -33,14 +33,14 @@ jobs:
33
  python llm-tagging.py
34
  python filter-faults.py
35
  python train.py
36
- - name: List tags folder
37
- run: ls -R tags || echo "tags folder not found"
38
  - name: Commit and Push Changes
39
  run: |
40
  git config --global user.name "github-actions[bot]"
41
  git config --global user.email "github-actions[bot]@users.noreply.github.com"
42
- git add tags
43
- git commit -m "Add tags generated by script"
44
  git push
45
  env:
46
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 
33
  python llm-tagging.py
34
  python filter-faults.py
35
  python train.py
36
+ - name: List data folder
37
+ run: ls -R data || echo "data folder not found"
38
  - name: Commit and Push Changes
39
  run: |
40
  git config --global user.name "github-actions[bot]"
41
  git config --global user.email "github-actions[bot]@users.noreply.github.com"
42
+ git add data
43
+ git commit -m "LLM-generated tags uploaded"
44
  git push
45
  env:
46
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
.github/workflows/visualization.yml CHANGED
@@ -27,6 +27,7 @@ jobs:
27
 
28
  - name: Run Visualization Script
29
  run: |
 
30
  python embedding_gen.py
31
  - name: List plots folder
32
  run: ls -R plots || echo "plots not found"
@@ -34,9 +35,21 @@ jobs:
34
  run: |
35
  git config --global user.name "github-actions[bot]"
36
  git config --global user.email "github-actions[bot]@users.noreply.github.com"
 
37
  git add plots
38
  git commit -m "Add plots generated by script"
39
  git push
40
  env:
41
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
42
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  - name: Run Visualization Script
29
  run: |
30
+ python tag-posting.py
31
  python embedding_gen.py
32
  - name: List plots folder
33
  run: ls -R plots || echo "plots not found"
 
35
  run: |
36
  git config --global user.name "github-actions[bot]"
37
  git config --global user.email "github-actions[bot]@users.noreply.github.com"
38
+ git add tags
39
  git add plots
40
  git commit -m "Add plots generated by script"
41
  git push
42
  env:
43
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
44
 
45
+ sync-to-hub:
46
+ runs-on: ubuntu-latest
47
+ steps:
48
+ - uses: actions/checkout@v3
49
+ with:
50
+ fetch-depth: 0
51
+ lfs: true
52
+ - name: Push to hub
53
+ env:
54
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
55
+ run: git push https://Robzy:[email protected]/spaces/Robzy/jobbert_knowledge_extraction main
README.md CHANGED
@@ -17,37 +17,39 @@ This projects aims to monitor in-demand skills for machine learning roles. Skill
17
 
18
  ![Header Image](header.png)
19
 
20
- ### [Monitoring Platform Link](https://huggingface.co/spaces/jjzha/skill_extraction_demo)
21
 
22
  ## Architecture & Frameworks
23
 
24
-
25
- - ** Hugging Face Spaces **
26
- - ** Gradio **
27
- - ** GitHub Actions **
28
- - ** Rapid API **
29
- - ** Weight & Biases **
30
- - ** Rapid API **
31
- - ** OpenAI API **
32
 
33
 
34
  # High-Level Overview
35
 
36
- ## Model: skills extraction model
37
-
38
- ## Inference
39
- 1. Extracting new job abs from Indeed/LinkedIn
40
- 2. Extract skills from job ads via skills extraction model
41
 
42
- ## Online training
43
- Continual training, extract ground truth via LLM with multi-shot learning with examples.
44
 
45
- ## Skill compilation
46
- Save all skills. Make a comprehensive overview by:
47
 
48
- 1. Embed skills to a vector with an embedding model
49
- 2. Perform clustering with KMeans
50
- 2. Visualize clustering with dimensionality reduction (UMAP)
 
 
 
 
 
 
 
51
 
52
 
53
  # Job Scraping
@@ -94,11 +96,3 @@ We generate embeddings for technical skills listed in .txt files and visualizes
94
  - **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
95
  - **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
96
 
97
- # Scheduling
98
-
99
- To monitor the in-demand skills and update our model continously, scheduling is employed. The following scripts are scheduled every Sunday:
100
-
101
- 1. Job-posting scraping: fetching job descriptions for machine learning from LinkedIn
102
- 2. Skills tagging with LLM: we decide to extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engeneering.
103
- 3. Training
104
- 4. Embedding and visualizatio - skills are embedded and visualized with KMeans clustering
 
17
 
18
  ![Header Image](header.png)
19
 
20
+ ### [Monitoring Platform Link](https://huggingface.co/spaces/Robzy/jobbert_knowledge_extraction)
21
 
22
  ## Architecture & Frameworks
23
 
24
+ - **Hugging Face Spaces**: Used as an UI to host interactive visualisation of skills embeddings and their clusters.
25
+ - **GitHub Actions**: Used to schedule training, inference and visualisation-updating scripts.
26
+ - **Rapid API**: The API used to scrape job descriptions from LinkedIn
27
+ - **Weights & Biases**: Used for model training monitoring as well as model storing.
28
+ - **OpenAI API**: Used to extract ground-truth from job descriptions by leveraging multi-shot learning and prompt engineering.
 
 
 
29
 
30
 
31
  # High-Level Overview
32
 
33
+ ## Models
34
+ * **BERT** - finetuned skill extraction model, lightweight.
35
+ * **LLM** - gpt-4o for skill extraction with multi-shot learning. Computationally expensive.
36
+ * **Embedding model** - [SentenceTransformers](https://sbert.net/) used to embed skills into vectors.
37
+ * [**spaCy**](https://spacy.io/models/en#en_core_web_sm) - sentence tokenization model.
38
 
39
+ ## Pipeline
 
40
 
41
+ The follow scripts are scheduled to automate the skill monitoring and model tranining processes continually.
 
42
 
43
+ ### 1. Job-posting scraping
44
+ Fetching job descriptions for machine learning from LinkedIn via Rapid API.
45
+ ### 2. Skills tagging with LLM
46
+ We opinionately extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engineering.
47
+ ### 3. Model training
48
+ The skill extraction model is finetuned with respect to the extracted groundtruth.
49
+ ### 4. Skills tagging with JobBERT
50
+ Skills are extracted from job-postings with finetuned model
51
+ ### 5. Embedding & visualization
52
+ Extracted skills are embedded, reduced and clustered with an embedding model, UMAP and K-means respectively.
53
 
54
 
55
  # Job Scraping
 
96
  - **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
97
  - **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
98
 
 
 
 
 
 
 
 
 
tag-posting.py CHANGED
@@ -4,6 +4,7 @@ from transformers import AutoTokenizer, BertForTokenClassification, TrainingArgu
4
  import torch
5
  from typing import List
6
  import os
 
7
 
8
 
9
  ### Parsing job posting
@@ -215,18 +216,33 @@ def backfill():
215
 
216
  print(f"Saved skills to: {tag_path}")
217
 
218
- def tag_date():
219
 
220
- pass
 
221
 
222
- if __name__ == '__main__':
 
 
 
 
 
223
 
224
- # Backfill
225
- backfill()
 
 
 
 
 
 
 
 
 
226
 
 
 
227
 
228
- # path = './job-postings/03-01-2024/2.txt'
229
- # sents = parse_post(path)
230
- # skills = extract_skills(sents)
231
- # skills_save('./tags/03-01-2024/2.txt',skills)RAPID_API_KEY : 60a10b11e6msh821d32f6e1e955ep15b5b1jsnf61a46680409
232
- 1
 
4
  import torch
5
  from typing import List
6
  import os
7
+ from datetime import datetime
8
 
9
 
10
  ### Parsing job posting
 
216
 
217
  print(f"Saved skills to: {tag_path}")
218
 
219
+ def tag_date(date):
220
 
221
+ tag_dir = os.path.join(os.getcwd(), 'tags', date)
222
+ job_dir = os.path.join(os.getcwd(), 'job-postings', date)
223
 
224
+ for job in os.listdir(job_dir):
225
+
226
+ job_path = os.path.join(job_dir, job)
227
+ tag_path = os.path.join(tag_dir, job)
228
+
229
+ print(f"Processing job file: {job_path}")
230
 
231
+ if not os.path.exists(tag_dir):
232
+ os.makedirs(tag_dir)
233
+ print(f"Created directory: {tag_dir}")
234
+
235
+ sents = parse_post(job_path)
236
+ skills = extract_skills(sents)
237
+ skills_save(tag_path, skills)
238
+
239
+ print(f"Saved skills to: {tag_path}")
240
+
241
+ if __name__ == '__main__':
242
 
243
+ # Backfill all job postings
244
+ # backfill()
245
 
246
+ # Tag today's job postings
247
+ date = datetime.today().strftime('%m-%d-%Y')
248
+ tag_date(date)
 
 
tags/03-01-2025/1.txt CHANGED
@@ -1,33 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ML
2
- -
3
- AI based R & D
 
 
4
  MSc in Data Science
5
  Python
6
  Go
7
  MLOps
8
  MLFlow
9
- Kubeflow )
 
10
  Hydra
11
  numpy
12
  TensorFlow
13
- DevOps
14
  CI
15
- /
16
  CD
17
- runner deployment & management
18
- pipeline creation
19
- testing
20
  ML
21
  ML
22
  PyTorch
23
  TensorFlow
24
- Containers
25
- engines, orchestration tools and
26
  Docker
27
  Kaniko
28
  Kubernetes
29
  Helm
30
- Cloud ecosystems
31
  AWS
32
  Infrastructure management
33
  Ansible
 
1
+ Ericsson
2
+ mobile
3
+ Ericsson
4
+ Ericsson
5
+ Networked Society
6
+ Eric
7
+ 6
8
+ 6G
9
+ regulation
10
+ standardization
11
+ 6th
12
+ cloud
13
+ cloud
14
+ MLOps
15
  ML
16
+ AI
17
+ cloud
18
+ 6G
19
+ standard
20
  MSc in Data Science
21
  Python
22
  Go
23
  MLOps
24
  MLFlow
25
+ Kubeflow
26
+ Python
27
  Hydra
28
  numpy
29
  TensorFlow
30
+ Dev
31
  CI
 
32
  CD
33
+ deployment
34
+ pipeline
 
35
  ML
36
  ML
37
  PyTorch
38
  TensorFlow
39
+ Jax
40
+ Con
41
  Docker
42
  Kaniko
43
  Kubernetes
44
  Helm
45
+ Cloud
46
  AWS
47
  Infrastructure management
48
  Ansible
tags/03-01-2025/2.txt CHANGED
@@ -3,11 +3,19 @@ Automation
3
  data analysis
4
  image recognition
5
  automation
 
6
  Artificial Intelligence
7
  feasibility studies
 
 
 
 
8
  data analysis
9
  Data Science
10
- degree in software engineering
 
11
  Artificial Intelligence
12
  Vision Systems
13
- English
 
 
 
3
  data analysis
4
  image recognition
5
  automation
6
+ Transformers
7
  Artificial Intelligence
8
  feasibility studies
9
+ AI
10
+ industry
11
+ .
12
+ operational
13
  data analysis
14
  Data Science
15
+ degree
16
+ software engineering
17
  Artificial Intelligence
18
  Vision Systems
19
+ project
20
+ English
21
+ Con
tags/03-01-2025/3.txt CHANGED
@@ -1,10 +1,20 @@
 
 
1
  SQL
2
  cloud infrastructure
3
  APIs
 
 
4
  Python
5
  infra
6
  database
7
- Types
 
 
 
 
 
 
8
  SaaS
9
  agile development
10
  sprint planning
@@ -19,4 +29,6 @@ cloud environments
19
  Azure
20
  data processing
21
  Databricks
22
- English
 
 
 
1
+ data
2
+ web
3
  SQL
4
  cloud infrastructure
5
  APIs
6
+ data
7
+ Market
8
  Python
9
  infra
10
  database
11
+ scraping
12
+ Python
13
+ cloud
14
+ APIs
15
+ Typescript
16
+ node
17
+ anals
18
  SaaS
19
  agile development
20
  sprint planning
 
29
  Azure
30
  data processing
31
  Databricks
32
+ English
33
+ T
34
+ contract