|
--- |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
--- |
|
|
|
# Embedding model for Labor Space |
|
|
|
This repository is fine-tuned BERT model for the [Labor Space : A Unifying Representation of the Labor Market via Large Language Models](https://arxiv.org/abs/2311.06310) |
|
|
|
## Model description |
|
LABERT(Labor market + BERT) is a BERT based sentence-transformers model fine-tuned on a domain-specific corpus of labor market text data. |
|
We fine-tune the original BERT model in two ways to capture the latent structure of the labor market. |
|
More precisely, it was fine-tuned with two objectives: |
|
|
|
- Context learning : We use HuggingFace’s “[fill mask](https://huggingface.co/tasks/fill-mask)” pipeline with descriptions for each entity to cover context information of the labor market at the individual word token level. |
|
We concatenate (1) 308 NAICS 4-digit descriptions, (2) O*NET’s descriptions for |
|
36 skills,25 knowledge domains, 46 abilities, 1,016 occupations, (3) ESCO’s descriptions for 15,000 skills, |
|
3,000 occupations, and (4) 489 Crunchbase S&P 500 firm descriptions, excluding their labels. |
|
|
|
- Relation learning : We build an additional fine-tuning process to incorporate inter-entity relatedness. |
|
Different types of the labor market are interwined with the other unit of the labor market. For example, industry-specific occupational employment |
|
represents the numerical relatedness between industry and occupation and tells us which occupations are conceptually close to specific industry entities. Relation learning |
|
makes our embedding space capture this inter-entity relatedness. As a result of relation learning, entitiy embedding is more closer to highly associated |
|
other entities than it does not. For more detail, see Section 3.4 Fine-tuning for relation learning in the [paper](https://arxiv.org/abs/2311.06310) |
|
|
|
## How to use |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer, models |
|
|
|
base_model = "seongwoon/LAbert" |
|
embedding_model = models.Transformer(base_model) ## Step 1: use an existing language model |
|
|
|
pooling_model = models.Pooling(embedding_model.get_word_embedding_dimension()) ## Step 2: use a pool function over the token embeddings |
|
pooling_model.pooling_mode_mean_tokens = True |
|
pooling_model.pooling_mode_cls_token = False |
|
pooling_model.pooling_mode_max_tokens = False |
|
|
|
model = SentenceTransformer(modules=[embedding_model, pooling_model]) ## Join steps 1 and 2 using the modules argument |
|
|
|
|
|
dancer_description = "Perform dances. May perform on stage, for broadcasting, or for video recording" |
|
embedding_of_dancer_description = model.encode(dancer, convert_to_tensor= True) |
|
|
|
print(description_embedding) |
|
``` |
|
|
|
## Full Model Architecture |
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) |
|
) |
|
``` |
|
|
|
## Citing & Authors |
|
```bibtex |
|
@inproceedings{kim2024labor, |
|
title={Labor Space: A Unifying Representation of the Labor Market via Large Language Models}, |
|
author={Kim, Seongwoon and Ahn, Yong-Yeol and Park, Jaehyuk}, |
|
booktitle={Proceedings of the ACM on Web Conference 2024}, |
|
pages={2441--2451}, |
|
year={2024} |
|
} |
|
``` |