tinyvvision / README.md
ProCreations's picture
Update README.md
afdc6b5 verified
|
raw
history blame
2.03 kB
metadata
license: mit
language:
  - en
pipeline_tag: zero-shot-image-classification
tags:
  - vision
  - simple
  - small

tinyvvision 🧠✨

tinyvvision is a compact, synthetic curriculum-trained vision-language model designed to demonstrate real zero-shot capability in a minimal setup. Despite its small size (~630k parameters), it aligns images and captions effectively by learning shared visual-language embeddings.

What tinyvvision can do:

  • Match simple geometric shapes (circles, stars, hearts, triangles, etc.) and descriptive captions (e.g., "a red circle", "a yellow star").
  • Perform genuine zero-shot generalization, meaning it can correctly match captions to shapes and colors it has never explicitly encountered during training.

Model Details:

  • Type: Contrastive embedding (CLIP-style, zero-shot)
  • Parameters: ~630,000 (tiny!)
  • Training data: Fully synthetic—randomly generated shapes, letters, numbers, and symbols paired with descriptive text captions.
  • Architecture:
    • Image Encoder: Simple CNN
    • Text Encoder: Small embedding layer + bidirectional GRU
  • Embedding Dim: 128-dimensional shared embedding space

Examples of Zero-Shot Matching:

  • Seen during training: "a red circle" → correctly matches the drawn red circle.
  • Never seen: "a teal lightning bolt" → correctly matched a hand-drawn lightning bolt shape, despite never having seen one during training.

Limitations:

  • tinyvvision is designed as a demonstration of zero-shot embedding and generalization on synthetic data. It is not trained on real-world data or complex scenarios. While robust within its domain (simple geometric shapes and clear captions), results may vary significantly on more complicated or out-of-domain inputs.

How to Test tinyvvision:

Check out the provided inference script to easily test your own shapes and captions. Feel free to challenge tinyvvision with new, unseen combinations to explore its generalization capability!

Enjoy experimenting!