agentlans's picture
Upload 13 files
a6ca6ab verified
metadata
library_name: transformers
license: mit
base_model: agentlans/deberta-v3-xsmall-zyda-2
tags:
  - generated_from_trainer
model-index:
  - name: deberta-v3-xsmall-zyda-2-transformed-readability-new
    results: []

deberta-v3-xsmall-zyda-2-transformed-readability-new

Model Overview

This model is a fine-tuned version of agentlans/deberta-v3-xsmall-zyda-2 designed to predict text readability. It achieves the following results on the evaluation set:

  • Loss: 0.0273
  • MSE: 0.0273

Dataset Description

The dataset used for training comprises approximately 800 000 paragraphs with corresponding readability metrics from four diverse sources:

  1. HuggingFace's Fineweb-Edu
  2. Ronen Eldan's TinyStories
  3. Wikipedia-2023-11-embed-multilingual-v3 (English only)
  4. ArXiv Abstracts-2021
  • Text Length: 50 to 2000 characters per paragraph
  • Readability Grade: Median of six readability metrics (Flesch-Kincaid, Gunning Fog, SMOG, Automated Readability Index, Coleman-Liau, Linsear Write)

Data Transformation

  • U.S. reading grade levels were transformed using the Box-Cox method (λ = 0.8766912)
  • Standardization and scale inversion were applied to generate 'readability' scores
  • Higher scores indicate easier readability

Transformation Statistics

  • λ (lambda) = 0.8766912
  • Mean (before standardization) = 7.908629
  • Standard deviation (before standardization) = 3.339119

Usage Example

import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
model_name = "agentlans/deberta-v3-xsmall-zyda-2-readability"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prediction function
def predict_score(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    return logits.item()

# Grade level conversion function
def grade_level(y):
    lambda_, mean, sd = 0.8766912, 7.908629, 3.339119
    y_unstd = (-y) * sd + mean
    return np.power((y_unstd * lambda_ + 1), (1 / lambda_))

# Example
input_text = "The mitochondria is the powerhouse of the cell."
readability = predict_score(input_text)
grade = grade_level(readability)
print(f"Predicted score: {readability:.2f}\nGrade: {grade:.1f}")

Sample Outputs

Text Readability Grade
I like to eat apples. 2.21 1.6
The cat is on the mat. 2.17 1.7
Birds are singing in the trees. 2.05 2.1
The sun is shining brightly today. 1.95 2.5
She enjoys reading books in her free time. 1.84 2.9
The quick brown fox jumps over the lazy dog. 1.75 3.2
After a long day at work, he finally relaxed with a cup of tea. 1.21 5.4
As the storm approached, the sky turned a deep shade of gray, casting an eerie shadow over the landscape. 0.54 8.2
Despite the challenges they faced, the team remained resolute in their pursuit of excellence and innovation. -0.52 13.0
In a world increasingly dominated by technology, the delicate balance between human connection and digital interaction has become a focal point of contemporary discourse. -1.91 19.5

Training Procedure

Hyperparameters

  • Learning rate: 5e-05
  • Train batch size: 64
  • Eval batch size: 8
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • LR scheduler: Linear
  • Number of epochs: 3.0

Training Results

Training Loss Epoch Step Validation Loss MSE
0.0297 1.0 13589 0.0302 0.0302
0.0249 2.0 27178 0.0279 0.0279
0.0218 3.0 40767 0.0273 0.0273

Framework Versions

  • Transformers: 4.46.3
  • PyTorch: 2.5.1+cu124
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3