This is a tiny BERT model for Bashkir, intended for fixing OCR errors.

Here is the code to run it (it uses a custom tokenizer, with the code downloaded in the runtime):

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

MODEL_NAME = 'slone/bert-tiny-char-ctc-bak-denoise'
model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, revision='194109')

def fix_text(text, verbose=False, spaces=2):
    with torch.inference_mode():
        batch = tokenizer(text, return_tensors='pt', spaces=spaces, padding=True, truncation=True, return_token_type_ids=False).to(model.device)
        logits = torch.log_softmax(model(**batch).logits, axis=-1)
    return tokenizer.decode(logits[0].argmax(-1), skip_special_tokens=True)

print(fix_text("Э Ҡаратау ҙы белмәйем."))
# Ә Ҡаратауҙы белмәйем.

The model works by:

  • inserting special characters (spaces) between each input character,
  • performing token classification (when for most tokens, predicted output equals input, but some may modify it),
  • and removing the special characters from the output.

It was trained on a parallel corpus (corrupted + fixed sentence) with CTC loss. On our test dataset, it reduces OCR errors by 41%.

Training code: here. Training details: in this post (in Russian).

Downloads last month
26
Safetensors
Model size
2.8M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.