OWG
/

ONNX
vision
Raghav Prabhakar commited on
Commit
e29e67c
·
1 Parent(s): 4bd3178

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md CHANGED
@@ -1,3 +1,117 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - vision
5
+ datasets:
6
+ - imagenet-21k
7
  ---
8
+
9
+ # ImageGPT (small-sized model)
10
+
11
+ ImageGPT (iGPT) model pre-trained on ImageNet ILSVRC 2012 (14 million images, 21,843 classes) at resolution 32x32. It was introduced in the paper [Generative Pretraining from Pixels](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf) by Chen et al. and first released in [this repository](https://github.com/openai/image-gpt). See also the official [blog post](https://openai.com/blog/image-gpt/).
12
+
13
+
14
+ ## Model description
15
+
16
+ The ImageGPT (iGPT) is a transformer decoder model (GPT-like) pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-21k, at a resolution of 32x32 pixels.
17
+
18
+ The goal for the model is simply to predict the next pixel value, given the previous ones.
19
+
20
+ By pre-training the model, it learns an inner representation of images that can then be used to:
21
+ - extract features useful for downstream tasks: one can either use ImageGPT to produce fixed image features, in order to train a linear model (like a sklearn logistic regression model or SVM). This is also referred to as "linear probing".
22
+ - perform (un)conditional image generation.
23
+
24
+ ## Intended uses & limitations
25
+
26
+ You can use the raw model for either feature extractor or (un) conditional image generation.
27
+
28
+ ### How to use
29
+
30
+ Here is how to use this model as feature extractor:
31
+
32
+ ```python
33
+ from transformers import AutoFeatureExtractor
34
+ from onnxruntime import InferenceSession
35
+ from datasets import load_dataset
36
+
37
+ # load image
38
+ dataset = load_dataset("huggingface/cats-image")
39
+ image = dataset["test"]["image"][0]
40
+
41
+ # load model
42
+ feature_extractor = AutoFeatureExtractor.from_pretrained("openai/imagegpt-small")
43
+ session = InferenceSession("model/model.onnx")
44
+
45
+ # ONNX Runtime expects NumPy arrays as input
46
+ inputs = feature_extractor(image, return_tensors="np")
47
+ outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
48
+ ```
49
+ Or you can use the model with classification head that returns logits
50
+ ```python
51
+ from transformers import AutoFeatureExtractor
52
+ from onnxruntime import InferenceSession
53
+ from datasets import load_dataset
54
+
55
+ # load image
56
+ dataset = load_dataset("huggingface/cats-image")
57
+ image = dataset["test"]["image"][0]
58
+
59
+ # load model
60
+ feature_extractor = AutoFeatureExtractor.from_pretrained("openai/imagegpt-small")
61
+ session = InferenceSession("model/model_classification.onnx")
62
+
63
+ # ONNX Runtime expects NumPy arrays as input
64
+ inputs = feature_extractor(image, return_tensors="np")
65
+ outputs = session.run(output_names=["logits"], input_feed=dict(inputs))
66
+ ```
67
+ ## Original implementation
68
+
69
+ Follow [this link](https://huggingface.co/openai/imagegpt-small) to see the original implementation.
70
+
71
+ ## Training data
72
+
73
+ The ImageGPT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes.
74
+
75
+ ## Training procedure
76
+
77
+ ### Preprocessing
78
+
79
+ Images are first resized/rescaled to the same resolution (32x32) and normalized across the RGB channels. Next, color-clustering is performed. This means that every pixel is turned into one of 512 possible cluster values. This way, one ends up with a sequence of 32x32 = 1024 pixel values, rather than 32x32x3 = 3072, which is prohibitively large for Transformer-based models.
80
+
81
+ ### Pretraining
82
+
83
+ Training details can be found in section 3.4 of v2 of the paper.
84
+
85
+ ## Evaluation results
86
+
87
+ For evaluation results on several image classification benchmarks, we refer to the original paper.
88
+
89
+ ### BibTeX entry and citation info
90
+
91
+ ```bibtex
92
+ @InProceedings{pmlr-v119-chen20s,
93
+ title = {Generative Pretraining From Pixels},
94
+ author = {Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya},
95
+ booktitle = {Proceedings of the 37th International Conference on Machine Learning},
96
+ pages = {1691--1703},
97
+ year = {2020},
98
+ editor = {III, Hal Daumé and Singh, Aarti},
99
+ volume = {119},
100
+ series = {Proceedings of Machine Learning Research},
101
+ month = {13--18 Jul},
102
+ publisher = {PMLR},
103
+ pdf = {http://proceedings.mlr.press/v119/chen20s/chen20s.pdf},
104
+ url = {https://proceedings.mlr.press/v119/chen20s.html
105
+ }
106
+ ```
107
+
108
+ ```bibtex
109
+ @inproceedings{deng2009imagenet,
110
+ title={Imagenet: A large-scale hierarchical image database},
111
+ author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
112
+ booktitle={2009 IEEE conference on computer vision and pattern recognition},
113
+ pages={248--255},
114
+ year={2009},
115
+ organization={Ieee}
116
+ }
117
+ ```