--- library_name: transformers license: mit base_model: agentlans/deberta-v3-xsmall-zyda-2 tags: - generated_from_trainer model-index: - name: deberta-v3-xsmall-zyda-2-transformed-readability-new results: [] --- # deberta-v3-xsmall-zyda-2-transformed-readability-new ## Model Overview This model is a fine-tuned version of [agentlans/deberta-v3-xsmall-zyda-2](https://huggingface.co/agentlans/deberta-v3-xsmall-zyda-2) designed to predict text readability. It achieves the following results on the evaluation set: - Loss: 0.0273 - MSE: 0.0273 ## Dataset Description The [dataset used for training](https://huggingface.co/datasets/agentlans/readability) comprises approximately 800 000 paragraphs with corresponding readability metrics from four diverse sources: 1. HuggingFace's Fineweb-Edu 2. Ronen Eldan's TinyStories 3. Wikipedia-2023-11-embed-multilingual-v3 (English only) 4. ArXiv Abstracts-2021 - **Text Length**: 50 to 2000 characters per paragraph - **Readability Grade**: Median of six readability metrics (Flesch-Kincaid, Gunning Fog, SMOG, Automated Readability Index, Coleman-Liau, Linsear Write) ### [Data Transformation](https://huggingface.co/datasets/agentlans/text-stats#readability-score-calculation) - U.S. reading grade levels were transformed using the Box-Cox method (λ = 0.8766912) - Standardization and scale inversion were applied to generate 'readability' scores - Higher scores indicate easier readability ### Transformation Statistics - λ (lambda) = 0.8766912 - Mean (before standardization) = 7.908629 - Standard deviation (before standardization) = 3.339119 ## Usage Example ```python import torch import numpy as np from transformers import AutoModelForSequenceClassification, AutoTokenizer # Device setup device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load model and tokenizer model_name = "agentlans/deberta-v3-xsmall-zyda-2-readability" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device) tokenizer = AutoTokenizer.from_pretrained(model_name) # Prediction function def predict_score(text): inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device) with torch.no_grad(): logits = model(**inputs).logits return logits.item() # Grade level conversion function def grade_level(y): lambda_, mean, sd = 0.8766912, 7.908629, 3.339119 y_unstd = (-y) * sd + mean return np.power((y_unstd * lambda_ + 1), (1 / lambda_)) # Example input_text = "The mitochondria is the powerhouse of the cell." readability = predict_score(input_text) grade = grade_level(readability) print(f"Predicted score: {readability:.2f}\nGrade: {grade:.1f}") ``` ## Sample Outputs | Text | Readability | Grade | |------|------------:|------:| | I like to eat apples. | 2.21 | 1.6 | | The cat is on the mat. | 2.17 | 1.7 | | Birds are singing in the trees. | 2.05 | 2.1 | | The sun is shining brightly today. | 1.95 | 2.5 | | She enjoys reading books in her free time. | 1.84 | 2.9 | | The quick brown fox jumps over the lazy dog. | 1.75 | 3.2 | | After a long day at work, he finally relaxed with a cup of tea. | 1.21 | 5.4 | | As the storm approached, the sky turned a deep shade of gray, casting an eerie shadow over the landscape. | 0.54 | 8.2 | | Despite the challenges they faced, the team remained resolute in their pursuit of excellence and innovation. | -0.52 | 13.0 | | In a world increasingly dominated by technology, the delicate balance between human connection and digital interaction has become a focal point of contemporary discourse. | -1.91 | 19.5 | ## Training Procedure ### Hyperparameters - Learning rate: 5e-05 - Train batch size: 64 - Eval batch size: 8 - Seed: 42 - Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08) - LR scheduler: Linear - Number of epochs: 3.0 ### Training Results | Training Loss | Epoch | Step | Validation Loss | MSE | |:-------------:|:-----:|:-----:|:---------------:|:------:| | 0.0297 | 1.0 | 13589 | 0.0302 | 0.0302 | | 0.0249 | 2.0 | 27178 | 0.0279 | 0.0279 | | 0.0218 | 3.0 | 40767 | 0.0273 | 0.0273 | ## Framework Versions - Transformers: 4.46.3 - PyTorch: 2.5.1+cu124 - Datasets: 3.1.0 - Tokenizers: 0.20.3