Model Card for Astro-HEP-BERT

Astro-HEP-BERT is a bidirectional transformer designed primarily to generate contextualized word embeddings for computational conceptual analysis in astrophysics and high-energy physics (HEP). Built upon Google's bert-base-uncased, the model underwent additional training for three epochs using the Astro-HEP Corpus, containing 21.84 million paragraphs found in more than 600,000 scholarly articles sourced from arXiv, all pertaining to astrophysics and/or high-energy physics (HEP). The sole training objective was masked language modeling.

The Astro-HEP-BERT project demonstrates the general feasibility of training a customized bidirectional transformer for computational conceptual analysis in the history, philosophy, and sociology of science as an open-source endeavor that does not require a substantial budget. Leveraging only freely available code, weights, and text inputs, the entire training process was conducted on a single MacBook Pro Laptop (M2/96GB).

For further insights into the model, the corpus, and the underlying research project (Network Epistemology in Practice) please refer to the following two papers:

Model Details

Developer: Arno Simons
Funded by: The European Union under Grant agreement ID: 101044932
Language (NLP): English
License: apache-2.0
Parent model: Google's bert-base-uncased

arnosimons
/

astro-hep-bert

Model Card for Astro-HEP-BERT

Model Details

Datasets used to train arnosimons/astro-hep-bert