Spaces:
Sleeping
Sleeping
Remove info
Browse files
app.py
CHANGED
@@ -1286,71 +1286,11 @@ with gr.Blocks(title="Babel-ImageNet Quiz") as demo:
|
|
1286 |
# Title Area
|
1287 |
gr.Markdown(
|
1288 |
"""
|
1289 |
-
# Are you smarter🤓 than CLIP🤖?
|
1290 |
-
|
1291 |
-
<small>by Gregor Geigle, WüNLP & Computer Vision Lab, University of Würzburg</small>
|
1292 |
-
|
1293 |
-
In this quiz, you play against a CLIP model (specifically: [mSigLIP](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256), a multilingual [SigLIP](https://arxiv.org/abs/2303.15343) model) and try to correctly classify the images over the 1000 ImageNet classes (in English) or over our (partial) Babel-ImageNet translations of those classes.
|
1294 |
-
Select your language, click 'Start' and start guessing! We'll keep track of your score and of your opponent's.
|
1295 |
-
> **Disclaimer:** Translations and images are derived automatically and can be wrong, unusual, or mismatch! This is supposed to be a fun game to explore the dataset and see how a CLIP model would answer the questions and not a product.
|
1296 |
-
> We do *not* use the official ImageNet images. Instead, we use images linked in BabelNet for each class, which are often from Wikipedia and have not been checked for suitability.
|
1297 |
-
|
1298 |
-
> **Content Warning:** There are spiders, insects, and various animals under the images. Please take caution if those might scare you.
|
1299 |
-
|
1300 |
-
<details>
|
1301 |
-
<summary> <b> FAQ</b> (click me to read)</summary>
|
1302 |
-
<p><b>'Over 1000 classes? I just see 4.'</b> True, you have it easier and you only have to chose between 4 classes. These are the top-4 picks of your opponent (+ the correct class if they are wrong). Your opponent has it harder: they have to deal with all classes.</p>
|
1303 |
-
<p><b>'Who is my opponent?'</b> Your opponent CLIP model is [mSigLIP](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256), a powerful but small multilingual model with only 370M parameters.</p>
|
1304 |
-
<p><b>'My game crashed/ I got an error!'</b> This usually happens because of problems with the image URLs. You can try the button to reroll the image or start a new round by clicking the 'Start' button again.</p>
|
1305 |
-
</details>
|
1306 |
"""
|
1307 |
)
|
1308 |
-
|
1309 |
-
with gr.Column(scale=1):
|
1310 |
-
gr.Markdown(
|
1311 |
-
"""
|
1312 |
-
<details>
|
1313 |
-
<summary> <b>What is CLIP? </b> (click me to read)</summary>
|
1314 |
-
<p>
|
1315 |
-
<a href='https://arxiv.org/abs/2103.00020'>CLIP</a> are vision-language models that learn to encode images and text in a joint semantic embedding space, where related concepts are close together.
|
1316 |
-
With CLIP, you can search through, filter, or group large image datasets. The image encoder in CLIP also powers many of the large vision language models like Llava 1.5!
|
1317 |
-
</p>
|
1318 |
-
<p>
|
1319 |
-
Your opponent CLIP model [mSigLIP](https://arxiv.org/abs/2303.15343) in this quiz does 'zero-shot image classification': We encode all possible class labels and the image and we check which class is most similar; this is then the class chosen by CLIP.
|
1320 |
-
</p>
|
1321 |
-
</details>
|
1322 |
-
"""
|
1323 |
-
)
|
1324 |
-
with gr.Column(scale=1):
|
1325 |
-
gr.Markdown(
|
1326 |
-
"""
|
1327 |
-
<details>
|
1328 |
-
<summary> <b>What is ImageNet? </b> (click me to read)</summary>
|
1329 |
-
<p>
|
1330 |
-
ImageNet is a challenging image classification dataset with 1000 diverse classes covering animals, plants, human-made objects and more.
|
1331 |
-
It is a very popular dataset used to benchmark CLIP models because strong results here usually indicates that the image model is overall usefull for many tasks.
|
1332 |
-
</p>
|
1333 |
-
</details>
|
1334 |
-
"""
|
1335 |
-
)
|
1336 |
-
with gr.Column(scale=1):
|
1337 |
-
gr.Markdown(
|
1338 |
-
"""
|
1339 |
-
<details>
|
1340 |
-
<summary> <b>What is Babel-ImageNet? </b> (click me to read)</summary>
|
1341 |
-
<p>
|
1342 |
-
ImageNet class labels are only in English but we want to use CLIP models also in other languages. How can we know how good a CLIP model is outside of English?
|
1343 |
-
This is the goal of Babel-ImageNet: to translate the English labels to other languages. However, automatic translation can give bad results for many languages and human translation is expensive.
|
1344 |
-
</p>
|
1345 |
-
<p>
|
1346 |
-
Instead, we use the fact that ImageNet was constructed using WordNet and WordNet in turn can be linked to the multilingual resource BabelNet.
|
1347 |
-
Using this link, we can get reliable (partial) translations of the English labels.
|
1348 |
-
For more details, please read our <a href='https://arxiv.org/abs/2306.08658'>paper.</a>
|
1349 |
-
</p>
|
1350 |
-
</details>
|
1351 |
-
"""
|
1352 |
-
)
|
1353 |
-
# language select dropdown
|
1354 |
with gr.Row():
|
1355 |
# language_select = gr.Dropdown(
|
1356 |
# choices=main_language_values,
|
|
|
1286 |
# Title Area
|
1287 |
gr.Markdown(
|
1288 |
"""
|
1289 |
+
# Are you smarter🤓 than CLIP🤖?
|
1290 |
+
<small>adapted from the original code by Gregor Geigle, WüNLP & Computer Vision Lab, University of Würzburg</small>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1291 |
"""
|
1292 |
)
|
1293 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1294 |
with gr.Row():
|
1295 |
# language_select = gr.Dropdown(
|
1296 |
# choices=main_language_values,
|