|
--- |
|
license: apache-2.0 |
|
--- |
|
# BERT NER Organization Name Extraction |
|
|
|
## Overview |
|
Fine-tune of the `bert-base-uncased` [model](https://huggingface.co/bert-base-uncased), trained to identify and classify named entities within affiliation strings, focusing on organizations and locations. |
|
|
|
## Training Data |
|
Training data comprised approximately 500,000 programatically annotated items, where named entities in affiliation strings were tagged with their respective types (organizations, locations), and all other text is marked as extraneous. Example annotation format: |
|
|
|
``` |
|
O: Internal |
|
O: Medicine |
|
O: Complex |
|
B-ORG: College |
|
I-ORG: of |
|
I-ORG: Medical |
|
I-ORG: Sciences |
|
B-LOC: New |
|
I-LOC: Delhi |
|
B-LOC: India |
|
``` |
|
|
|
The training data was derived from [OpenAlex](https://openalex.org/) affiliation strings and their [ROR ID](https://ror.org) assignments. Tagging was done using the corresponding name and location metadata from the assigned ROR record. Location names were further supplemented with aliases derived from the [Unicode Common Locale Data Repository (CLDR)](https://cldr.unicode.org/). |
|
|
|
## Training Details |
|
- **Dataset Size**: ~500,000 items |
|
- **Number of Epochs**: 3 |
|
- **Optimizer**: AdamW |
|
- **Training Environment**: Google Colab T-4 High Ram instance |
|
- **Training Duration**: Approximately 8 hours |
|
|
|
|
|
## Usage |
|
See https://github.com/ror-community/affiliation-matching-experimental/tree/main/ner_tests/inference for example usage. |