The fine-tuned ViT model that beats Google's state-of-the-art model and OpenAI's famous GPT4 for maps of cities around the world

Image-classification fine-tuned model that identifies which city map is illustrated from an image input.

The Vision Transformer (ViT) base model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

How to use:

Inference script

For more code examples, we refer to ViTdocumentation.

Training data

This Google's ViT-base-patch16-224 for city identification model was fine-tuned on the STEM-AI-mtl/City_map dataset, contaning overer 600 images of 45 different maps of cities around the world.

Training procedure

A Transformer training was performed on google/vit-base-patch16-224 on a 4 Gb Nvidia GTX 1650 GPU.

Training notebook

Training evaluation results

The most accurate output model was obtained from a learning rate of 1e-3. The quality of the training was evaluated with the training dataset and resulted in the following metrics:

{'eval_loss': 1.3691096305847168,
'eval_accuracy': 0.6666666666666666,
'eval_runtime': 13.0277,
'eval_samples_per_second': 4.606,
'eval_steps_per_second': 0.154,
'epoch': 2.82}

Model Card Authors

STEM.AI: [email protected]
William Harbec

Downloads last month
19
Safetensors
Model size
85.8M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train STEM-AI-mtl/City_map-vit-base-patch16-224