license: llama2
language:
- en
base_model:
- m3rg-iitd/llamat-2
tags:
- material science
- large language model
- domain adaptation
- scientific domain adaptation
- materials copilot
- information extraction
- table understanding
- table data parsing
Model Card for LLaMat-2-Chat
Overview
LLaMat-2-Chat is a specialized large language model designed to serve as an AI copilot for materials research. Finetuned from LLaMat-2, this model is adapted for tasks such as information extraction from material science text and tabular data. It provides advanced capabilities in scientific data processing, assisting researchers in analyzing and interpreting material science literature, reports, and datasets.
For more details, refer to our paper: Foundational Large Language Models for Materials Research.
Model Details
- Model Type: Large Language Model (LLM)
- Base Model: LLaMat-2 (continued pretraining of LLaMA-2 on material science data)
- Language: English
- License: LLaMA-2 License
- Tags: Material Science, Domain Adaptation, Table Understanding, Scientific Data Parsing, Materials Copilot
- Developed by: M3RG, IIT Delhi & DAIR, IIT Delhi
Key Features
- Instruction Following Abilities: Optimized for understanding and processing instructions in the material science domain.
- Domain-Specific Expertise: Pretrained on material science tokens, enabling high performance in scientific applications.
- Applications: information extraction, table understanding, and parsing data for research tasks.
Intended Use
LLaMat-2-Chat is designed to assist researchers, scientists, and industry professionals in:
- Extracting structured information from material science texts and tables.
- Analyzing experimental results and processing large datasets.
- Assisting in literature review and knowledge discovery.
- Supporting research-driven natural language queries related to material science.
This model is intended for academic and industrial research purposes.
Technical Specification
Hardware Infrastructure
- Pretraining: 2 Cerebras CS-2 Wafer-Scale Engines (WSE-2)
- Finetuning: 8 NVIDIA A100 80GB GPUs
- Inferencing: 1 NVIDIA A100 80GB GPU
Software Stack
- Frameworks: PyTorch, Hugging Face Transformers, Meditron-LLM Library
Training Data
LLaMat-2-Chat was trained on a curated corpus of material science literature, scientific papers, structured datasets, and technical reports. The training set includes:
- material science research papers published in journals of Elsevier and Springer.
- Material science community discourse
- Redpajama dataset
- Openorca instruction finetuning dataset
- mathQA dataset
- MatSciNLP benchmark dataset
- task specific datasets (mentioned in Table A.2 in Foundational Large Language Models for Materials Research.)
Results
detailed results and comparison with existing models can be read from Foundational Large Language Models for Materials Research.
Development and Support
- Developed by: M3RG, IIT Delhi & DAIR, IIT Delhi
- Compute Support:
- IIT Delhi High-Performance Computing Cluster: Supported fine-tuning and inference stages.
- Edinburgh International Data Facility (EIDF): EIDF Cerebras CS Clusters provided access to Cerebras CS2 clusters for pretraining.
Repository with training and evaluation code
- Repository: LLaMat-2 on GitHub
Citation
If you use LLaMat-2-Chat in your research, please cite our work:
@article{LLaMat-2,
author = {Vaibhav Mishra and Somaditya Singh and Dhruv Ahlawat and Mohd Zaki and Vaibhav Bihani and Hargun Singh Grover and Biswajit Mishra and Santiago Miret and Mausam and N. M. Anoop Krishnan},
title = {Foundational Large Language Models for Materials Research},
journal = {arXiv preprint arXiv:2412.09560},
year = {2024},
url = {https://arxiv.org/abs/2412.09560}
}