Seznam
/

small-e-czech

Inference Endpoints

Model card Files Files and versions Community

small-e-czech / README.md

mkocian's picture

Update README.md

9cc7a78 about 3 years ago

|

1.66 kB

	# Small-E-Czech

	Small-E-Czech is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a Czech corpus created at Seznam.cz. Like other pretrained models, it should be finetuned on a downstream task of interest before use.

	### How to use the discriminator in transformers
	```python
	from transformers import ElectraForPreTraining, ElectraTokenizerFast
	import torch

	discriminator = ElectraForPreTraining.from_pretrained("seznam/small-e-czech")
	tokenizer = ElectraTokenizerFast.from_pretrained(
	"seznam/small-e-czech", strip_accents=False
	)

	sentence = "Za hory, za doly, mé zlaté parohy"
	fake_sentence = "Za hory, za doly, kočka zlaté parohy"

	fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
	fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
	outputs = discriminator(fake_inputs)
	predictions = torch.nn.Sigmoid()(outputs[0]).cpu().detach().numpy()

	for token in fake_sentence_tokens:
	print("{:>7s}".format(token), end="")
	print()

	for prediction in predictions.squeeze():
	print("{:7.1f}".format(prediction), end="")
	print()
	```

	In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:

	```
	[CLS] za hory , za dol ##y , kočka zlaté paro ##hy [SEP]
	0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.3 0.2 0.1 0.0
	```

	### Finetuning

	For instructions on how to finetune the model on a new task, see the official HuggingFace transformers [tutorial](https://huggingface.co/transformers/training.html).