littleworth
/

protgpt2-distilled-small

+---
+license: apache-2.0
+datasets:
+- nferruz/UR50_2021_04
+tags:
+- chemistry
+- biology
+---
+### Model Description
+This model card describes the distilled version of [ProtGPT2](https://huggingface.co/nferruz/ProtGPT2), referred to as `protgpt2-distilled-small`. The distillation process for this model follows the methodology of knowledge distillation from a larger teacher model to a smaller, more efficient student model. The process combines both "Soft Loss" (Knowledge Distillation Loss) and "Hard Loss" (Cross-Entropy Loss) to ensure the student model not only generalizes like its teacher but also retains practical prediction capabilities.
+### Technical Details
+**Distillation Parameters:**
+- **Temperature (T):** 10
+- **Alpha (α):** 0.1
+- **Model Architecture:**
+  - **Number of Layers:** 6
+  - **Number of Attention Heads:** 8
+  - **Embedding Size:** 768
+**Dataset Used:**
+- The model was distilled using a subset of the evaluation dataset provided by [nferruz/UR50_2021_04](https://huggingface.co/datasets/nferruz/UR50_2021_04).
+<strong>Loss Formulation:</strong>
+<ul>
+    <li><strong>Soft Loss:</strong> <span>&#x2112;<sub>soft</sub> = KL(softmax(s/T), softmax(t/T))</span></li>
+    <li><strong>Hard Loss:</strong> <span>&#x2112;<sub>hard</sub> = -∑<sub>i</sub> y<sub>i</sub> log(softmax(s<sub>i</sub>))</span></li>
+    <li><strong>Combined Loss:</strong> <span>&#x2112; = α &#x2112;<sub>hard</sub> + (1 - α) &#x2112;<sub>soft</sub></span></li>
+</ul>
+### Performance
+The distilled model, `protgpt2-distilled-tiny`, demonstrates a substantial increase in inference speed—up to 6 times faster than the pretrained version. This assessment is based on evaluations using \(n=5\) tests, showing that while the speed is significantly enhanced, the model still maintains perplexities comparable to the original.
+![Evals](https://images.mobilism.org/?di=PYFQ1N5V)
+![Loss](https://images.mobilism.org/?di=LPUY)
+### Usage
+```
+from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextGenerationPipeline
+import re
+# Load the model and tokenizer
+model_name = "littleworth/protgpt2-distilled-tiny"
+tokenizer = GPT2Tokenizer.from_pretrained(model_name)
+model = GPT2LMHeadModel.from_pretrained(model_name)
+# Initialize the pipeline
+text_generator = TextGenerationPipeline(
+    model=model, tokenizer=tokenizer, device=0
+)  # specify device if needed
+# Generate sequences
+generated_sequences = text_generator(
+    "<|endoftext|>",
+    max_length=100,
+    do_sample=True,
+    top_k=950,
+    repetition_penalty=1.2,
+    num_return_sequences=10,
+    pad_token_id=tokenizer.eos_token_id,  # Set pad_token_id to eos_token_id
+    eos_token_id=0,
+    truncation=True,
+)
+def clean_sequence(text):
+    # Remove the "<|endoftext|>" token
+    text = text.replace("<|endoftext|>", "")
+    # Remove newline characters and non-alphabetical characters
+    text = "".join(char for char in text if char.isalpha())
+    return text
+# Print the generated sequences
+for i, seq in enumerate(generated_sequences):
+    cleaned_text = clean_sequence(seq["generated_text"])
+    print(f">Seq_{i}")
+    print(cleaned_text)
+```
+### Use Cases
+1. **High-Throughput Screening in Drug Discovery:** The distilled ProtGPT2 facilitates rapid mutation screening in drug discovery by predicting protein variant stability efficiently. Its reduced size allows for swift fine-tuning on new datasets, enhancing the pace of target identification.
+2. **Portable Diagnostics in Healthcare:** Suitable for handheld devices, this model enables real-time protein analysis in remote clinical settings, providing immediate diagnostic results.
+3. **Interactive Learning Tools in Academia:** Integrated into educational software, the distilled model helps biology students simulate and understand protein dynamics without advanced computational resources.
+### References
+- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
+- Original ProtGPT2 Paper: [Link to paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9329459/)