tingyuansen commited on
Commit
dc9d99b
·
verified ·
1 Parent(s): b64b3ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -38
README.md CHANGED
@@ -1,58 +1,104 @@
1
  ---
2
- base_model: meta-llama/Meta-Llama-3-8B
 
 
 
3
  tags:
4
- - generated_from_trainer
5
- datasets:
6
- - customized
7
- model-index:
8
- - name: astro-7b_llama3_lr-2e-5
9
- results: []
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # astro-7b_llama3_lr-2e-5
16
 
17
- This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on the customized dataset.
18
 
19
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- More information needed
22
 
23
- ## Intended uses & limitations
 
 
24
 
25
- More information needed
 
 
26
 
27
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
28
 
29
- More information needed
 
 
 
30
 
31
- ## Training procedure
 
32
 
33
- ### Training hyperparameters
 
 
 
34
 
35
- The following hyperparameters were used during training:
36
- - learning_rate: 2e-05
37
- - train_batch_size: 24
38
- - eval_batch_size: 1
39
- - seed: 42
40
- - distributed_type: multi-GPU
41
- - num_devices: 8
42
- - total_train_batch_size: 192
43
- - total_eval_batch_size: 8
44
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
45
- - lr_scheduler_type: cosine
46
- - lr_scheduler_warmup_ratio: 0.03
47
- - num_epochs: 1.0
48
 
49
- ### Training results
50
 
 
51
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- ### Framework versions
54
 
55
- - Transformers 4.40.0
56
- - Pytorch 2.2.2+cu121
57
- - Datasets 2.14.6
58
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
  tags:
7
+ - llama-3
8
+ - astronomy
9
+ - astrophysics
10
+ - arxiv
11
+ inference: false
12
+ base_model:
13
+ - meta-llama/Llama-3-8b-hf
14
  ---
15
 
16
+ # AstroLLaMA-3-8B-Base_AIC
 
17
 
18
+ AstroLLaMA-3-8B is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
19
 
20
+ ## Model Details
21
 
22
+ - **Base Architecture**: LLaMA-3-8b
23
+ - **Training Data**: Abstract, Introduction, and Conclusion (AIC) sections from arXiv's astro-ph category papers (from arXiv's inception up to January 2024)
24
+ - **Data Processing**: Optical character recognition (OCR) on PDF files using the Nougat tool, followed by summarization using Qwen-2-8B and LLaMA-3.1-8B.
25
+ - **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
26
+ - **Training Details**:
27
+ - Learning rate: 2 × 10⁻⁵
28
+ - Total batch size: 96
29
+ - Maximum token length: 512
30
+ - Warmup ratio: 0.03
31
+ - No gradient accumulation
32
+ - BF16 format
33
+ - Cosine decay schedule for learning rate reduction
34
+ - Training duration: 1 epoch (approximately 32 A100 GPU hours)
35
+ - **Primary Use**: Next token prediction for astronomy-related text generation and analysis
36
+ - **Reference**: Pan et al. 2024 [Link to be added]
37
 
38
+ ## Generating text from a prompt
39
 
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer
42
+ import torch
43
 
44
+ # Load the model and tokenizer
45
+ tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-7b-base_aic")
46
+ model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-7b-base_aic", device_map="auto")
47
 
48
+ # Create the pipeline with explicit truncation
49
+ from transformers import pipeline
50
+ generator = pipeline(
51
+ "text-generation",
52
+ model=model,
53
+ tokenizer=tokenizer,
54
+ device_map="auto",
55
+ truncation=True,
56
+ max_length=512
57
+ )
58
 
59
+ # Example prompt from an astronomy paper
60
+ prompt = "In this letter, we report the discovery of the highest redshift, " \
61
+ "heavily obscured, radio-loud QSO candidate selected using JWST NIRCam/MIRI, " \
62
+ "mid-IR, sub-mm, and radio imaging in the COSMOS-Web field. "
63
 
64
+ # Set seed for reproducibility
65
+ torch.manual_seed(42)
66
 
67
+ # Generate text
68
+ generated_text = generator(prompt, do_sample=True)
69
+ print(generated_text[0]['generated_text'])
70
+ ```
71
 
72
+ ## Model Limitations and Biases
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ A key limitation identified during the development of this model is that training solely on astro-ph data may not be sufficient to significantly improve performance over the base model, especially for the already highly performant LLaMA-3 series. This suggests that to achieve substantial gains, future iterations may need to incorporate a broader range of high-quality astronomical data beyond arXiv, such as textbooks, Wikipedia, and curated summaries.
75
 
76
+ Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
77
 
78
+ | Model | Score (%) |
79
+ |-------|-----------|
80
+ | **AstroLLaMA-3-8B (AstroMLab)** | **72.3** |
81
+ | LLaMA-3-8B | 72.0 |
82
+ | Gemma-2-9B | 71.5 |
83
+ | Qwen-2.5-7B | 70.4 |
84
+ | Yi-1.5-9B | 68.4 |
85
+ | InternLM-2.5-7B | 64.0 |
86
+ | Mistral-7B-v0.3 | 63.9 |
87
+ | ChatGLM3-6B | 50.4 |
88
 
 
89
 
90
+ As shown, while AstroLLaMA-3-8B performs competitively among models in its class, it does not surpass the performance of the base LLaMA-3-8B model. This underscores the challenges in developing specialized models and the need for more diverse and comprehensive training data.
91
+
92
+ It's worth noting that the AstroLLaMA-3-8B-Plus which we will release in the next model release addresses these limitations by expanding beyond astro-ph data.
93
+
94
+ ## Ethical Considerations
95
+
96
+ While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.
97
+
98
+ ## Citation
99
+
100
+ If you use this model in your research, please cite:
101
+
102
+ ```
103
+ [Citation for Pan et al. 2024 to be added]
104
+ ```