---
library_name: transformers
tags:
- chemistry
- biology
- SELFIES
- life-sciences
license: mit
datasets:
- mikemayuare/PubChem10M_SMILES_SELFIES
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
MLM RoBERTa-based pretrained model. Ready to fine-tune on specific tasks.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

MLM RoBERTa-based pretrained model. 2 million of Self-Referencing Embedded Strings (SELFIES) were used and BPE as tokenizer.

- **Developed by:** Miguelangel Leon Mayuare
- **Funded by:** This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442.
- **Shared by:** Miguelangel Leon Mayuare
- **Model type:** RoBERTa-based
- **Language(s) (NLP):** SELFIES
- **License:** MIT

### Model Sources

<!-- Provide the basic links for the model. -->

- **Paper:** On review

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model instended use is for fine-tuning on dowstream tasks were SELFIES is the main input.

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

The model can be directly used for the classification of chemical compounds and prediction of molecular properties using SELFIES representations.

### Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

The model can be fine-tuned for specific tasks such as drug discovery, toxicity prediction, and other cheminformatics applications using specific datasets.

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

The model should not be used for tasks outside of cheminformatics or without proper validation for the specific task. Misuse includes using the model for generating invalid chemical compounds or predictions outside the domain of trained data.
Only works with SELFIES, for SMILES search miekmayuare repository.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model may inherit biases from the training data. Limitations include potential overfitting to the pre-training tasks and resource intensity for training and fine-tuning.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

2 million SELFIES were used to pretrain the model in order to mitigate missrepresentation (over and under-representation) of any type of molecules. Validation on known datasets for downstream tasks is the best way to see its limitations.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mikemayuare/SELFYBPE")
model = AutoModel.from_pretrained("mikemayuare/SELFYBPE")
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The training data comprised 2 million molecules from the PubChem dataset. SMILES strings were converted to SELFIES using the selfies library.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The models were pre-trained for 20 epochs using the AdamW optimizer on an NVIDIA 3060 GPU with 12GiB of VRAM.

#### Preprocessing

SMILES strings were converted to SELFIES using the selfies library, and tokenizers were trained on a subset of 1 million molecules from the PubChem dataset.

#### Training Hyperparameters

- **Training regime:** fp32
- **Batch size:** 32
- **Number of epochs:** 20
- **Optimizer:** AdamW

#### Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

Training time was approximately 72 hours on the specified hardware. Checkpoint sizes are approximately 500MB each.

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

Testing was conducted on MoleculeNet datasets, specifically BBBP, HIV, and Tox21.

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

Evaluation metrics were disaggregated by dataset and task type (e.g., binary classification for BBBP).

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

The primary evaluation metric was the ROC-AUC score, which is commonly used for binary classification tasks in cheminformatics (on fine-tuned models).

### Results

The models tokenized with APE generally outperformed those tokenized with BPE. SMILES models showed better performance than SELFIES models in most cases.

#### Summary

The model achieved competitive performance on standard benchmarks, outperforming several baseline models in specific tasks.

## Model Examination

<!-- Relevant interpretability work for the model goes here -->

Interpretability analyses showed that models tokenized with APE preserved the chemical context better than those with BPE, leading to higher classification accuracy.

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions were estimated using the Machine Learning Impact calculator.

- **Hardware Type:** NVIDIA 3060 GPU
- **Hours used:** 72 hours
- **Cloud Provider:** Not applicable
- **Compute Region:** Local
- **Carbon Emitted:** Approximately 50 kg CO2eq

## Technical Specifications

### Model Architecture and Objective

The model architecture is based on RoBERTa with 6 hidden layers, 768 hidden size, 1536 intermediate size, and 12 attention heads.

### Compute Infrastructure

#### Hardware

- **Type:** NVIDIA 3060 GPU
- **VRAM:** 12GiB

#### Software

- **Framework:** PyTorch
- **Libraries:** transformers, selfies, DeepChem, Optuna

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```bibtex
@mastersthesis{leon2024chemical,
  title={Chemical Language Modeling},
  author={Miguelangel Augusto Leon Mayuare},
  year={2024},
  school={NOVA Information Management School}
}
```

**APA:**

Mayuare, M. A. L. (2024). *Chemical Language Modeling* (Master's thesis). NOVA Information Management School.

## Glossary

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

**SELFIES:** A string-based representation of molecules.
**SMILES:** Simplified Molecular Input Line Entry System, a notation for describing the structure of chemical species.

## More Information

For more details, refer to the (pending publication)

## Model Card Authors

- Miguelangel Augusto Leon Mayuare

## Model Card Contact

For inquiries, please contact migueleonm@gmail.com