|
--- |
|
library_name: transformers |
|
license: mit |
|
base_model: agentlans/deberta-v3-xsmall-zyda-2 |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: deberta-v3-xsmall-zyda-2-transformed-readability-new |
|
results: [] |
|
--- |
|
|
|
# deberta-v3-xsmall-zyda-2-transformed-readability-new |
|
|
|
## Model Overview |
|
|
|
This model is a fine-tuned version of [agentlans/deberta-v3-xsmall-zyda-2](https://huggingface.co/agentlans/deberta-v3-xsmall-zyda-2) designed to predict text readability. It achieves the following results on the evaluation set: |
|
- Loss: 0.0273 |
|
- MSE: 0.0273 |
|
|
|
## Dataset Description |
|
|
|
The [dataset used for training](https://huggingface.co/datasets/agentlans/readability) comprises approximately 800 000 paragraphs with corresponding readability metrics from four diverse sources: |
|
|
|
1. HuggingFace's Fineweb-Edu |
|
2. Ronen Eldan's TinyStories |
|
3. Wikipedia-2023-11-embed-multilingual-v3 (English only) |
|
4. ArXiv Abstracts-2021 |
|
|
|
- **Text Length**: 50 to 2000 characters per paragraph |
|
- **Readability Grade**: Median of six readability metrics (Flesch-Kincaid, Gunning Fog, SMOG, Automated Readability Index, Coleman-Liau, Linsear Write) |
|
|
|
### [Data Transformation](https://huggingface.co/datasets/agentlans/text-stats#readability-score-calculation) |
|
- U.S. reading grade levels were transformed using the Box-Cox method (位 = 0.8766912) |
|
- Standardization and scale inversion were applied to generate 'readability' scores |
|
- Higher scores indicate easier readability |
|
|
|
### Transformation Statistics |
|
- 位 (lambda) = 0.8766912 |
|
- Mean (before standardization) = 7.908629 |
|
- Standard deviation (before standardization) = 3.339119 |
|
|
|
## Usage Example |
|
|
|
```python |
|
import torch |
|
import numpy as np |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
# Device setup |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load model and tokenizer |
|
model_name = "agentlans/deberta-v3-xsmall-zyda-2-readability" |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Prediction function |
|
def predict_score(text): |
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device) |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
return logits.item() |
|
|
|
# Grade level conversion function |
|
def grade_level(y): |
|
lambda_, mean, sd = 0.8766912, 7.908629, 3.339119 |
|
y_unstd = (-y) * sd + mean |
|
return np.power((y_unstd * lambda_ + 1), (1 / lambda_)) |
|
|
|
# Example |
|
input_text = "The mitochondria is the powerhouse of the cell." |
|
readability = predict_score(input_text) |
|
grade = grade_level(readability) |
|
print(f"Predicted score: {readability:.2f}\nGrade: {grade:.1f}") |
|
``` |
|
|
|
## Sample Outputs |
|
|
|
| Text | Readability | Grade | |
|
|------|------------:|------:| |
|
| I like to eat apples. | 2.21 | 1.6 | |
|
| The cat is on the mat. | 2.17 | 1.7 | |
|
| Birds are singing in the trees. | 2.05 | 2.1 | |
|
| The sun is shining brightly today. | 1.95 | 2.5 | |
|
| She enjoys reading books in her free time. | 1.84 | 2.9 | |
|
| The quick brown fox jumps over the lazy dog. | 1.75 | 3.2 | |
|
| After a long day at work, he finally relaxed with a cup of tea. | 1.21 | 5.4 | |
|
| As the storm approached, the sky turned a deep shade of gray, casting an eerie shadow over the landscape. | 0.54 | 8.2 | |
|
| Despite the challenges they faced, the team remained resolute in their pursuit of excellence and innovation. | -0.52 | 13.0 | |
|
| In a world increasingly dominated by technology, the delicate balance between human connection and digital interaction has become a focal point of contemporary discourse. | -1.91 | 19.5 | |
|
|
|
## Training Procedure |
|
|
|
### Hyperparameters |
|
- Learning rate: 5e-05 |
|
- Train batch size: 64 |
|
- Eval batch size: 8 |
|
- Seed: 42 |
|
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08) |
|
- LR scheduler: Linear |
|
- Number of epochs: 3.0 |
|
|
|
### Training Results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | MSE | |
|
|:-------------:|:-----:|:-----:|:---------------:|:------:| |
|
| 0.0297 | 1.0 | 13589 | 0.0302 | 0.0302 | |
|
| 0.0249 | 2.0 | 27178 | 0.0279 | 0.0279 | |
|
| 0.0218 | 3.0 | 40767 | 0.0273 | 0.0273 | |
|
|
|
## Framework Versions |
|
- Transformers: 4.46.3 |
|
- PyTorch: 2.5.1+cu124 |
|
- Datasets: 3.1.0 |
|
- Tokenizers: 0.20.3 |
|
|