--- license: apache-2.0 library_name: transformers pipeline_tag: feature-extraction tags: - chemistry --- # selfies-ted selfies-ted is a project for encoding SMILES (Simplified Molecular Input Line Entry System) into SELFIES (SELF-referencing Embedded Strings) and generating embeddings for molecular representations. ![selfies-ted](selfies-ted.png) ## Model Architecture Configuration details Encoder and Decoder FFN dimensions: 256 Number of attention heads: 4 Number of encoder and decoder layers: 2 Total number of hidden layers: 6 Maximum position embeddings: 128 Model dimension (d_model): 256 ## Pretrained Models and Training Logs We provide checkpoints of the selfies-ted model pre-trained on a dataset of molecules curated from PubChem. The pre-trained model shows competitive performance on molecular representation tasks. For model weights: "HuggingFace link". To install and use the pre-trained model: Download the selfies_ted_model.pkl file from the "HuggingFace link". Add the selfies-ted selfies_ted_model.pkl to the models/ directory. The directory structure should look like the following: ``` models/ └── selfies_ted_model.pkl ``` ## Installation To use this project, you'll need to install the required dependencies. We recommend using a virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` ``` Install the required dependencies ``` pip install -r requirements.txt ``` ## Usage ### Import ``` import load ``` ### Training the Model To train the model, use the train.py script: ``` python train.py -f ``` Note: The actual usage may depend on the specific implementation in load.py. Please refer to the source code for detailed functionality. ### Load the model and tokenizer ``` load.load("path/to/checkpoint.pkl") ``` ### Encode SMILES strings ``` smiles_list = ["COC", "CCO"] ``` ``` embeddings = load.encode(smiles_list) ``` ## Example Notebook Example notebook of this project is `selfies-ted-example.ipynb`.