--- library_name: transformers tags: - chemistry - biology - SELFIES - life-sciences license: mit datasets: - mikemayuare/PubChem10M_SMILES_SELFIES --- # Model Card for Model ID MLM RoBERTa-based pretrained model. Ready to fine-tune on specific tasks. ## Model Details ### Model Description MLM RoBERTa-based pretrained model. 2 million of Self-Referencing Embedded Strings (SELFIES) were used and BPE as tokenizer. - **Developed by:** Miguelangel Leon Mayuare - **Funded by:** This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442. - **Shared by:** Miguelangel Leon Mayuare - **Model type:** RoBERTa-based - **Language(s) (NLP):** SELFIES - **License:** MIT ### Model Sources - **Paper:** On review ## Uses The model instended use is for fine-tuning on dowstream tasks were SELFIES is the main input. ### Direct Use The model can be directly used for the classification of chemical compounds and prediction of molecular properties using SELFIES representations. ### Downstream Use The model can be fine-tuned for specific tasks such as drug discovery, toxicity prediction, and other cheminformatics applications using specific datasets. ### Out-of-Scope Use The model should not be used for tasks outside of cheminformatics or without proper validation for the specific task. Misuse includes using the model for generating invalid chemical compounds or predictions outside the domain of trained data. Only works with SELFIES, for SMILES search miekmayuare repository. ## Bias, Risks, and Limitations The model may inherit biases from the training data. Limitations include potential overfitting to the pre-training tasks and resource intensity for training and fine-tuning. ### Recommendations 2 million SELFIES were used to pretrain the model in order to mitigate missrepresentation (over and under-representation) of any type of molecules. Validation on known datasets for downstream tasks is the best way to see its limitations. ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("mikemayuare/SELFYBPE") model = AutoModel.from_pretrained("mikemayuare/SELFYBPE") ``` ## Training Details ### Training Data The training data comprised 2 million molecules from the PubChem dataset. SMILES strings were converted to SELFIES using the selfies library. ### Training Procedure The models were pre-trained for 20 epochs using the AdamW optimizer on an NVIDIA 3060 GPU with 12GiB of VRAM. #### Preprocessing SMILES strings were converted to SELFIES using the selfies library, and tokenizers were trained on a subset of 1 million molecules from the PubChem dataset. #### Training Hyperparameters - **Training regime:** fp32 - **Batch size:** 32 - **Number of epochs:** 20 - **Optimizer:** AdamW #### Speeds, Sizes, Times Training time was approximately 72 hours on the specified hardware. Checkpoint sizes are approximately 500MB each. ## Evaluation #### Testing Data Testing was conducted on MoleculeNet datasets, specifically BBBP, HIV, and Tox21. #### Factors Evaluation metrics were disaggregated by dataset and task type (e.g., binary classification for BBBP). #### Metrics The primary evaluation metric was the ROC-AUC score, which is commonly used for binary classification tasks in cheminformatics (on fine-tuned models). ### Results The models tokenized with APE generally outperformed those tokenized with BPE. SMILES models showed better performance than SELFIES models in most cases. #### Summary The model achieved competitive performance on standard benchmarks, outperforming several baseline models in specific tasks. ## Model Examination Interpretability analyses showed that models tokenized with APE preserved the chemical context better than those with BPE, leading to higher classification accuracy. ## Environmental Impact Carbon emissions were estimated using the Machine Learning Impact calculator. - **Hardware Type:** NVIDIA 3060 GPU - **Hours used:** 72 hours - **Cloud Provider:** Not applicable - **Compute Region:** Local - **Carbon Emitted:** Approximately 50 kg CO2eq ## Technical Specifications ### Model Architecture and Objective The model architecture is based on RoBERTa with 6 hidden layers, 768 hidden size, 1536 intermediate size, and 12 attention heads. ### Compute Infrastructure #### Hardware - **Type:** NVIDIA 3060 GPU - **VRAM:** 12GiB #### Software - **Framework:** PyTorch - **Libraries:** transformers, selfies, DeepChem, Optuna ## Citation **BibTeX:** ```bibtex @mastersthesis{leon2024chemical, title={Chemical Language Modeling}, author={Miguelangel Augusto Leon Mayuare}, year={2024}, school={NOVA Information Management School} } ``` **APA:** Mayuare, M. A. L. (2024). *Chemical Language Modeling* (Master's thesis). NOVA Information Management School. ## Glossary **SELFIES:** A string-based representation of molecules. **SMILES:** Simplified Molecular Input Line Entry System, a notation for describing the structure of chemical species. ## More Information For more details, refer to the (pending publication) ## Model Card Authors - Miguelangel Augusto Leon Mayuare ## Model Card Contact For inquiries, please contact migueleonm@gmail.com