---
language: en
license: cc-by-4.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- bert
- accelerator-physics
- physics
- scientific-literature
- embeddings
- domain-specific
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: thellert/physbert_cased
model-index:
- name: AccPhysBERT
  results:
  - task:
      type: feature-extraction
      name: Feature Extraction
    dataset:
      name: Accelerator Physics Publications
      type: accelerator-physics
    metrics:
    - type: cosine_accuracy
      value: 0.91
      name: Citation Classification
    - type: v_measure
      value: 0.637
      name: Category Clustering (main)
    - type: ndcg_at_10
      value: 0.663
      name: Information Retrieval
datasets:
- inspire-hep
---

# AccPhysBERT

**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.

---

## Model Description

- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
- **Notable Features**:
  - Trained on 109 k accelerator-physics publications from INSPIRE HEP
  - Leverages 690 k citation pairs and 2 M synthetic query–source pairs
  - Trained via SentenceTransformers to produce dense, semantically rich embeddings

**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro  
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory  
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)  
**Language**: English  
**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)  
**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)  
[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)

---

## Training Data

- **Core Corpus**:  
  - 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")  
  - Over 1 GB of full-text markdown-style text (via OCR/Nougat)

- **Annotation Sources**:  
  - 690,000 citation pairs  
  - 49 semantic categories labeled via ChatGPT-4o  
  - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B

---

## Training Procedure

- **Fine-tuning Method**: SimCSE (contrastive loss)
- **Hyperparameters**:
  - Batch size: 512  
  - Learning rate: 2e-4  
  - Temperature: 0.05  
  - Weight decay: 0.01  
  - Optimizer: Adam  
  - Epochs: 2  
- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC  
- **Framework**: SentenceTransformers

---

## Evaluation Results

| Task                        | Metric                   | Score   |
|----------------------------|--------------------------|---------|
| Citation Classification    | Cosine Accuracy          | 91.0%   |
| Category Clustering        | V‑measure (main/sub)     | 63.7 / 77.2 |
| Information Retrieval      | nDCG@10                  | 66.3    |

AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.

---

## Example Usage

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
model = AutoModel.from_pretrained("thellert/accphysbert")

text = "We report on beam instabilities observed in the LCLS-II injector."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use mean pooling (excluding [CLS] and [SEP])
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
```


---

## Citation

If you use AccPhysBERT, please cite:

```bibtex
@article{Hellert_2025,
  title     = {Domain-specific text embedding model for accelerator physics},
  author    = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
  journal   = {Physical Review Accelerators and Beams},
  volume    = {28},
  number    = {4},
  pages     = {044601},
  year      = {2025},
  publisher = {American Physical Society},
  doi       = {10.1103/PhysRevAccelBeams.28.044601},
  url       = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
}
```

---

## Contact

Thorsten Hellert  
Lawrence Berkeley National Laboratory  
📧 thellert@lbl.gov

---

## Acknowledgments

This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.