Aalaa commited on
Commit
c3fded7
·
1 Parent(s): d301316

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -48
README.md CHANGED
@@ -3,12 +3,9 @@ license: apache-2.0
3
  tags:
4
  - vision
5
  - image-classification
6
- datasets:
7
- - imagenet-1k
8
- - imagenet-21k
9
  widget:
10
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
11
- example_title: Tiger
12
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
13
  example_title: Teapot
14
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
@@ -64,46 +61,3 @@ print("Predicted class:", model.config.id2label[predicted_class_idx])
64
  ```
65
 
66
  For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/vit.html#).
67
-
68
- ## Training data
69
-
70
- The ViT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes, and fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes.
71
-
72
- ## Training procedure
73
-
74
- ### Preprocessing
75
-
76
- The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).
77
-
78
- Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
79
-
80
- ### Pretraining
81
-
82
- The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224.
83
-
84
- ## Evaluation results
85
-
86
- For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
87
-
88
- ### BibTeX entry and citation info
89
-
90
- ```bibtex
91
- @misc{wu2020visual,
92
- title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision},
93
- author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
94
- year={2020},
95
- eprint={2006.03677},
96
- archivePrefix={arXiv},
97
- primaryClass={cs.CV}
98
- }
99
- ```
100
-
101
- ```bibtex
102
- @inproceedings{deng2009imagenet,
103
- title={Imagenet: A large-scale hierarchical image database},
104
- author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
105
- booktitle={2009 IEEE conference on computer vision and pattern recognition},
106
- pages={248--255},
107
- year={2009},
108
- organization={Ieee}
109
- }
 
3
  tags:
4
  - vision
5
  - image-classification
 
 
 
6
  widget:
7
+ - src: https://www.estal.com/FitxersWeb/331958/estal_carroussel_wg_spirits_5.jpg
8
+ example_title: Glass
9
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
10
  example_title: Teapot
11
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
 
61
  ```
62
 
63
  For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/vit.html#).