File size: 4,283 Bytes
4bb16a7 a6ca6ab 4bb16a7 a6ca6ab 4bb16a7 a6ca6ab |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
library_name: transformers
license: mit
base_model: agentlans/deberta-v3-xsmall-zyda-2
tags:
- generated_from_trainer
model-index:
- name: deberta-v3-xsmall-zyda-2-transformed-readability-new
results: []
---
# deberta-v3-xsmall-zyda-2-transformed-readability-new
## Model Overview
This model is a fine-tuned version of [agentlans/deberta-v3-xsmall-zyda-2](https://huggingface.co/agentlans/deberta-v3-xsmall-zyda-2) designed to predict text readability. It achieves the following results on the evaluation set:
- Loss: 0.0273
- MSE: 0.0273
## Dataset Description
The [dataset used for training](https://huggingface.co/datasets/agentlans/readability) comprises approximately 800 000 paragraphs with corresponding readability metrics from four diverse sources:
1. HuggingFace's Fineweb-Edu
2. Ronen Eldan's TinyStories
3. Wikipedia-2023-11-embed-multilingual-v3 (English only)
4. ArXiv Abstracts-2021
- **Text Length**: 50 to 2000 characters per paragraph
- **Readability Grade**: Median of six readability metrics (Flesch-Kincaid, Gunning Fog, SMOG, Automated Readability Index, Coleman-Liau, Linsear Write)
### [Data Transformation](https://huggingface.co/datasets/agentlans/text-stats#readability-score-calculation)
- U.S. reading grade levels were transformed using the Box-Cox method (λ = 0.8766912)
- Standardization and scale inversion were applied to generate 'readability' scores
- Higher scores indicate easier readability
### Transformation Statistics
- λ (lambda) = 0.8766912
- Mean (before standardization) = 7.908629
- Standard deviation (before standardization) = 3.339119
## Usage Example
```python
import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and tokenizer
model_name = "agentlans/deberta-v3-xsmall-zyda-2-readability"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Prediction function
def predict_score(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
logits = model(**inputs).logits
return logits.item()
# Grade level conversion function
def grade_level(y):
lambda_, mean, sd = 0.8766912, 7.908629, 3.339119
y_unstd = (-y) * sd + mean
return np.power((y_unstd * lambda_ + 1), (1 / lambda_))
# Example
input_text = "The mitochondria is the powerhouse of the cell."
readability = predict_score(input_text)
grade = grade_level(readability)
print(f"Predicted score: {readability:.2f}\nGrade: {grade:.1f}")
```
## Sample Outputs
| Text | Readability | Grade |
|------|------------:|------:|
| I like to eat apples. | 2.21 | 1.6 |
| The cat is on the mat. | 2.17 | 1.7 |
| Birds are singing in the trees. | 2.05 | 2.1 |
| The sun is shining brightly today. | 1.95 | 2.5 |
| She enjoys reading books in her free time. | 1.84 | 2.9 |
| The quick brown fox jumps over the lazy dog. | 1.75 | 3.2 |
| After a long day at work, he finally relaxed with a cup of tea. | 1.21 | 5.4 |
| As the storm approached, the sky turned a deep shade of gray, casting an eerie shadow over the landscape. | 0.54 | 8.2 |
| Despite the challenges they faced, the team remained resolute in their pursuit of excellence and innovation. | -0.52 | 13.0 |
| In a world increasingly dominated by technology, the delicate balance between human connection and digital interaction has become a focal point of contemporary discourse. | -1.91 | 19.5 |
## Training Procedure
### Hyperparameters
- Learning rate: 5e-05
- Train batch size: 64
- Eval batch size: 8
- Seed: 42
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
- LR scheduler: Linear
- Number of epochs: 3.0
### Training Results
| Training Loss | Epoch | Step | Validation Loss | MSE |
|:-------------:|:-----:|:-----:|:---------------:|:------:|
| 0.0297 | 1.0 | 13589 | 0.0302 | 0.0302 |
| 0.0249 | 2.0 | 27178 | 0.0279 | 0.0279 |
| 0.0218 | 3.0 | 40767 | 0.0273 | 0.0273 |
## Framework Versions
- Transformers: 4.46.3
- PyTorch: 2.5.1+cu124
- Datasets: 3.1.0
- Tokenizers: 0.20.3
|