grascii
/

gregg-vision-v0.2.1

vision-encoder-decoder

image-text-to-text

Model card Files Files and versions Community

gregg-vision-v0.2.1 / README.md

chanicpanic's picture

Update README.md

72de723 verified 6 months ago

|

history blame contribute delete

2.41 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- grascii/gregg-preanniversary-words
	pipeline_tag: image-to-text
	tags:
	- gregg
	- shorthand
	- stenography
	---

	# Gregg Vision v0.2.1

	Gregg Vision v0.2.1 generates a [Grascii](https://github.com/grascii/grascii) representation of a Gregg Shorthand form.

	- Model type: Vision Encoder Text Decoder
	- License: MIT
	- Repository: [Github](https://github.com/grascii/gregg-vision-v0.2.1)
	- Demo: [Grascii Search Space](https://huggingface.co/spaces/grascii/search)

	## Uses

	Given a grayscale image of a single shorthand form, Gregg Vision can be used to
	generate its Grascii representation. When combined with [Grascii Search](https://github.com/grascii/grascii),
	one can obtain possible English interpretations of the shorthand form.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoModelForVision2Seq, AutoImageProcessor, AutoTokenizer
	from PIL import Image
	import numpy as np


	model_id = "grascii/gregg-vision-v0.2.1"
	model = AutoModelForVision2Seq.from_pretrained(model_id)
	processor = AutoImageProcessor.from_pretrained(model_id)
	tokenizer = AutoTokenizer.from_pretrained(model_id)


	def generate_grascii(image: Image):
	# convert image to a single channel
	grayscale = image.convert("L")

	# prepare processor input
	images = np.array([grayscale])

	# preprocess image
	pixel_values = processor(images, return_tensors="pt").pixel_values

	# generate token ids
	ids = model.generate(pixel_values, max_new_tokens=12)[0]

	# decode ids and return grascii
	return tokenizer.decode(ids, skip_special_tokens=True)
	```

	Note: As of `transformers` v4.47.0, the model is incompatible with `pipeline` due to the
	model's single channel image input.

	## Technical Details

	### Model Architecture and Objective

	Gregg Vision v0.2.1 is a transformer model with a ViT encoder and a Roberta decoder.

	For training, the model was warm-started using
	[vit-small-patch16-224-single-channel](https://huggingface.co/grascii/vit-small-patch16-224-single-channel)
	for the encoder and a randomly initialized Roberta network for the decoder.

	### Training Data

	Gregg Vision v0.2.1 was trained on the [gregg-preanniversary-words](https://huggingface.co/datasets/grascii/gregg-preanniversary-words) dataset.

	### Training Hardware

	Gregg Vision v0.2.1 was trained using 1xT4.