Spaces:
Running
Running
update
Browse files
README.md
CHANGED
@@ -17,7 +17,7 @@ leveraging machine learning and natural language processing (NLP).
|
|
17 |
|
18 |
## Dataset
|
19 |
|
20 |
-
The data is retrieved from [Arbetsförmedlingen's (the Swedish Public Employment Service) API](https://jobstream.api.jobtechdev.se/)
|
21 |
|
22 |
## Method
|
23 |
|
@@ -33,7 +33,7 @@ The code is fairly simple thanks to the tools we have used.
|
|
33 |
|
34 |
1. The first thing one should do is to run `boostrap.py`. This is done only once (in the beginning) to initialize the Pinecone database and load all ads into it. This program calls the `get_all_ads` method in `get_ads.py`, which in turn calls the snapshot endpoint `https://jobstream.api.jobtechdev.se/snapshot` to get a snapshot of all the job listings up at this current time.
|
35 |
2. When all ads have been retrieved, we insert it into the Pinecone vector database. This is done through the `upsert_ads` method in `pinecone_handler.py`, which calls `_create_embedding` and `_prepare_metadata` to create embeddings and metadata respectively.
|
36 |
-
3. The `_create_embedding` function takes an ad as an input and parses the JSON values for headline, occupation and description keys, and then combines these three into a single text. It then encodes the text with the help of a SentenceTransformer. We chose the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
|
37 |
4. The `_prepare_metadata` function extracts metadata from the ad, which is stored together with the vector embedding in the Pinecone vector database. Since some JSON values such as email and municipality were nested, we had to parse them in a nested manner.
|
38 |
5. When 100 ads (our batch size for insertion) have been vectorized and retrieved metadata from, we upsert all the ads to the Pinecone vector database through the `_batch_upsert` function.
|
39 |
|
@@ -45,13 +45,13 @@ When the changes of job listings from this timestamp has been retrieved, it call
|
|
45 |
|
46 |
### Querying from the vector database
|
47 |
|
48 |
-
Querying from the Pinecone vector database is simple and fast thanks to the Pinecone API. When a resume is uploaded on the frontend (Streamlit app), the Streamlit app calls the `search_similar_ads` from the `PineconeHandler`, encoding the resume text with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
|
49 |
|
50 |
## How to run the code
|
51 |
|
52 |
1. Clone the Github repository to your local machine.
|
53 |
2. Navigate to the cloned repository folder on your machine in the terminal and run `python -r requirements.txt`
|
54 |
-
3. Sign up for an account at [Pinecone](https://www.pinecone.io/)
|
55 |
4. Save the API key as a Github Actions Secret, with the name `PINECONE_API_KEY`.
|
56 |
5. Run `python bootstrap.py`. This may take a while since all job listings have to be retrieved from the API and then vectorized and stored in the vector database.
|
57 |
6. To update the vector database, run `python main.py`. This should preferebly be scheduled using e.g. Github Actions Workflow.
|
@@ -59,5 +59,5 @@ Querying from the Pinecone vector database is simple and fast thanks to the Pine
|
|
59 |
|
60 |
## Potential improvements
|
61 |
|
62 |
-
1. The [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
|
63 |
2. Users should be able to filter on municipality or location, because the current app ignores where the person wants to work (often not explicitly mentioned in their resume), making many job listings not relevant anyway.
|
|
|
17 |
|
18 |
## Dataset
|
19 |
|
20 |
+
The data is retrieved from [Arbetsförmedlingen's (the Swedish Public Employment Service) API](https://jobstream.api.jobtechdev.se/). It gives access to all job listings which are published on their job listings bank, inlcuding real time information regarding changes to these listings such as new publications, deletions or updates or job descriptions.
|
21 |
|
22 |
## Method
|
23 |
|
|
|
33 |
|
34 |
1. The first thing one should do is to run `boostrap.py`. This is done only once (in the beginning) to initialize the Pinecone database and load all ads into it. This program calls the `get_all_ads` method in `get_ads.py`, which in turn calls the snapshot endpoint `https://jobstream.api.jobtechdev.se/snapshot` to get a snapshot of all the job listings up at this current time.
|
35 |
2. When all ads have been retrieved, we insert it into the Pinecone vector database. This is done through the `upsert_ads` method in `pinecone_handler.py`, which calls `_create_embedding` and `_prepare_metadata` to create embeddings and metadata respectively.
|
36 |
+
3. The `_create_embedding` function takes an ad as an input and parses the JSON values for headline, occupation and description keys, and then combines these three into a single text. It then encodes the text with the help of a SentenceTransformer. We chose the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It maps sentences and paragraphs to a 384 dimensional dense vector space and is fine-tuned on [nreimers/MiniLM-l&-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) to given a sentence from the pair, the model should predict which out of a set of randomly other sentences, was actually paired with it in their dataset. It is intended to be used as a sentence and short paragraph encoder.
|
37 |
4. The `_prepare_metadata` function extracts metadata from the ad, which is stored together with the vector embedding in the Pinecone vector database. Since some JSON values such as email and municipality were nested, we had to parse them in a nested manner.
|
38 |
5. When 100 ads (our batch size for insertion) have been vectorized and retrieved metadata from, we upsert all the ads to the Pinecone vector database through the `_batch_upsert` function.
|
39 |
|
|
|
45 |
|
46 |
### Querying from the vector database
|
47 |
|
48 |
+
Querying from the Pinecone vector database is simple and fast thanks to the Pinecone API. When a resume is uploaded on the frontend (Streamlit app), the Streamlit app calls the `search_similar_ads` from the `PineconeHandler`, encoding the resume text with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) SentenceTransformer, as the job listings were encoded with. It then queries the most similar vector embeddings from the Pinecone vector database and returns the `top_k` (default is 5) most similar job listings, along with their metadata. It then displays those jobs to the user, along with their similarity scores.
|
49 |
|
50 |
## How to run the code
|
51 |
|
52 |
1. Clone the Github repository to your local machine.
|
53 |
2. Navigate to the cloned repository folder on your machine in the terminal and run `python -r requirements.txt`
|
54 |
+
3. Sign up for an account at [Pinecone](https://www.pinecone.io/) and create an API key.
|
55 |
4. Save the API key as a Github Actions Secret, with the name `PINECONE_API_KEY`.
|
56 |
5. Run `python bootstrap.py`. This may take a while since all job listings have to be retrieved from the API and then vectorized and stored in the vector database.
|
57 |
6. To update the vector database, run `python main.py`. This should preferebly be scheduled using e.g. Github Actions Workflow.
|
|
|
59 |
|
60 |
## Potential improvements
|
61 |
|
62 |
+
1. The [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) truncates input text longer than 256 word pieces. To capture all the semantics from job listings, we probably need a sentence transformer which can embed longer inputs texts.
|
63 |
2. Users should be able to filter on municipality or location, because the current app ignores where the person wants to work (often not explicitly mentioned in their resume), making many job listings not relevant anyway.
|