|
--- |
|
license: apache-2.0 |
|
language: |
|
- ru |
|
base_model: |
|
- ai-forever/ruBert-large |
|
tags: |
|
- difficulty |
|
- cefr |
|
- regression |
|
--- |
|
# Model Card for Model ID |
|
|
|
Regression model which predicts difficulty score for an input text. Predicted scores can be mapped to CEFR levels. |
|
|
|
|
|
## Model Details |
|
|
|
Frozen BERT-large layers with a regressor on top. Trained on a mix of manually annotated datasets (more details on data will follow). |
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
``` |
|
class CustomModel(BertPreTrainedModel): |
|
def __init__(self, config, load_path=None, use_auth_token: str = None,): |
|
super().__init__(config) |
|
self.bert = BertModel(config) |
|
self.pre_classifier = nn.Linear(config.hidden_size, 128) |
|
self.dropout = nn.Dropout(0.2) |
|
self.classifier = nn.Linear(128, 1) |
|
|
|
# Apply Xavier initialization |
|
nn.init.xavier_uniform_(self.pre_classifier.weight) |
|
nn.init.xavier_uniform_(self.classifier.weight) |
|
if self.pre_classifier.bias is not None: |
|
nn.init.constant_(self.pre_classifier.bias, 0) |
|
if self.classifier.bias is not None: |
|
nn.init.constant_(self.classifier.bias, 0) |
|
|
|
|
|
def forward( |
|
self, |
|
input_ids, |
|
labels=None, |
|
attention_mask=None, |
|
token_type_ids=None, |
|
position_ids=None, |
|
): |
|
outputs = self.bert( |
|
input_ids, |
|
attention_mask=attention_mask, |
|
token_type_ids=token_type_ids, |
|
position_ids=position_ids, |
|
) |
|
|
|
|
|
pooled_output = outputs[0][:, 0] |
|
pooled_output = self.pre_classifier(pooled_output) |
|
pooled_output = nn.ReLU()(pooled_output) |
|
pooled_output = self.dropout(pooled_output) |
|
logits = self.classifier(pooled_output) |
|
|
|
if labels is not None: |
|
loss_fn = nn.MSELoss() |
|
loss = loss_fn(logits.view(-1), labels.view(-1)) |
|
return loss, logits |
|
else: |
|
return None, logits |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) |
|
config.num_labels = 1 |
|
|
|
model = CustomModel(config) |
|
model.load_state_dict(torch.load(f'{model_path}/pytorch_model.bin')) |
|
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
inputs = {key: value.to(device) for key, value in inputs.items()} |
|
|
|
with torch.no_grad(): |
|
_, logits = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], token_type_ids=inputs["token_type_ids"]) |
|
|
|
``` |
|
|
|
To map to CEFR, use: |
|
``` |
|
reg2cl2 = {'1.0': 'A1', '1.5': 'A12', '2.0': 'A2', '2.5': 'A2', '3.0': 'B1', '3.5': 'B12', '4.0': 'B2', '4.5': 'B2', '5.0': 'C1', '5.5': 'C12', '6.0': 'C2', '0.0': 'A1'} |
|
print("Predicted output (logits):", logits.item(), reg2cl2[str(float(round(logits.item())))]) |
|
``` |
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
+ learning_rate: 3e-4 |
|
+ num_train_epochs: 15.0 |
|
+ batch_size: 32 |
|
+ weight_decay: 0.1 |
|
+ adam_beta1: 0.9 |
|
+ adam_beta2: 0.99 |
|
+ adam_epsilon: 1e-8 |
|
+ max_grad_norm: 1.0 |
|
+ fp16: True |
|
|
|
|
|
|
|
## Evaluation on test set |
|
|
|
|
|
 |
|
|
|
## Citation |
|
|
|
Please refer to this repo when using the model. |