license: apache-2.0 | |
language: | |
- en | |
metrics: | |
- precision | |
base_model: | |
- distilbert/distilbert-base-uncased | |
pipeline_tag: text-classification | |
tags: | |
- pytorch | |
# Fake Job Predictor | |
## Data | |
1. Data trained comes from this Kaggle repository: https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction | |
2. Original data size is around 18k samples. To avoid the class imbalacing problem, it was undersampled the majority class (true jobs). | |
3. Final dataset used to train has a size of 4k sample. | |
## Model | |
1. Multi-head neural network. One head is used for each feature (description, requirements, and benefits of the job). | |
2. Best metrics achieved (over validation data-split): Precision: 0.83, Recall: 0.65, F1-score: 0.71 | |
3. Code used for training comes from this GitHub repo: https://github.com/sebassaras02/AdvancedDLCourse/blob/master/02_transformers_nlp/bert.ipynb | |
### Components: | |
Text Encoder: distilbert-base-uncased is used to encode the textual input into a dense vector. | |
## Future work: | |
Train over larger datasets and with more computer resources |