--- language: en license: cc-by-4.0 tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - bert - accelerator-physics - physics - scientific-literature - embeddings - domain-specific library_name: sentence-transformers pipeline_tag: feature-extraction base_model: thellert/physbert_cased model-index: - name: AccPhysBERT results: - task: type: feature-extraction name: Feature Extraction dataset: name: Accelerator Physics Publications type: accelerator-physics metrics: - type: cosine_accuracy value: 0.91 name: Citation Classification - type: v_measure value: 0.637 name: Category Clustering (main) - type: ndcg_at_10 value: 0.663 name: Information Retrieval datasets: - inspire-hep --- # AccPhysBERT **AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature. --- ## Model Description - **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE). - **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community. - **Notable Features**: - Trained on 109 k accelerator-physics publications from INSPIRE HEP - Leverages 690 k citation pairs and 2 M synthetic query–source pairs - Trained via SentenceTransformers to produce dense, semantically rich embeddings **Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro **Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory **Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned) **Language**: English **License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) **Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025) [https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601) --- ## Training Data - **Core Corpus**: - 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators") - Over 1 GB of full-text markdown-style text (via OCR/Nougat) - **Annotation Sources**: - 690,000 citation pairs - 49 semantic categories labeled via ChatGPT-4o - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B --- ## Training Procedure - **Fine-tuning Method**: SimCSE (contrastive loss) - **Hyperparameters**: - Batch size: 512 - Learning rate: 2e-4 - Temperature: 0.05 - Weight decay: 0.01 - Optimizer: Adam - Epochs: 2 - **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC - **Framework**: SentenceTransformers --- ## Evaluation Results | Task | Metric | Score | |----------------------------|--------------------------|---------| | Citation Classification | Cosine Accuracy | 91.0% | | Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 | | Information Retrieval | nDCG@10 | 66.3 | AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks. --- ## Example Usage ```python from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert") model = AutoModel.from_pretrained("thellert/accphysbert") text = "We report on beam instabilities observed in the LCLS-II injector." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # Use mean pooling (excluding [CLS] and [SEP]) token_embeddings = outputs.last_hidden_state[:, 1:-1, :] sentence_embedding = token_embeddings.mean(dim=1) ``` --- ## Citation If you use AccPhysBERT, please cite: ```bibtex @article{Hellert_2025, title = {Domain-specific text embedding model for accelerator physics}, author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea}, journal = {Physical Review Accelerators and Beams}, volume = {28}, number = {4}, pages = {044601}, year = {2025}, publisher = {American Physical Society}, doi = {10.1103/PhysRevAccelBeams.28.044601}, url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601} } ``` --- ## Contact Thorsten Hellert Lawrence Berkeley National Laboratory 📧 thellert@lbl.gov --- ## Acknowledgments This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.