Added Model card for the pose model using details from the paper.

#2
by gsalunke - opened
Files changed (1) hide show
  1. README.md +165 -95
README.md CHANGED
@@ -5,195 +5,265 @@ tags: []
5
 
6
  # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
  - **Funded by [optional]:** [More Information Needed]
22
  - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
  - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Training Details
77
 
78
  ### Training Data
 
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
132
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
 
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
 
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  # Model Card for Model ID
7
 
8
+ Finetuned Vision Transformer (ViT-16) model for classifying the pose of figures in MixTec Codices.
 
 
9
 
10
  ## Model Details
11
 
12
  ### Model Description
13
 
14
+ This model is designed for classifying the pose of figures depicted in the Mixtec codices(standing/not standing).The codices depict historical and mythological scenes using structured pictorial representations. The models Vision Transformer (ViT-16), was finetuned on a custom-labeled dataset of 1,300 figures extracted from three historical Mixtec codices.
15
 
 
16
 
17
+ - **Developed by:** [ufdatastudio.com](https://ufdatastudio.com/)
18
  - **Funded by [optional]:** [More Information Needed]
19
  - **Shared by [optional]:** [More Information Needed]
20
+ - **Model type:** Image Classification
21
+ - **Language:** Python
22
  - **License:** [More Information Needed]
23
+ - **Finetuned from model [optional]:** Vision Transformer (ViT-16)
 
 
24
 
25
+ ### Model Sources
26
 
27
+ - **Repository:** https://github.com/ufdatastudio/mixteclabeling
28
+ - **Paper:** [Analyzing Finetuned Vision Models for Mixtec Codex Interpretation](https://ufdatastudio.com/papers/webber2024analyzing.pdf)
29
+ - **Poster** https://ufdatastudio.com/papers/webber2024analyzing-poster.pdf
30
 
31
  ## Uses
32
 
 
 
33
  ### Direct Use
34
 
35
+ This model is intended for the classification of figures in historical Mixtec codices. The classification of pose assists in the interpretation of ancient Mixtec manuscripts, contributing to historical and anthropological research.
36
 
37
+ ### Downstream Use
38
 
39
+ This model may be used for more advanced tasks such as relationship extraction between figures within a codex scene, potentially helping to reconstruct the narratives depicted in the codices.
40
 
41
+ ### Out-of-Scope Use
42
 
43
+ Using the model for classification on datasets unrelated to Mixtec codices or datasets not following similar pictographic systems could yield inaccurate results. The model may not generalize well to modern or non-Mesoamerican artistic depictions.
44
 
45
+ ## Bias, Risks, and Limitations
46
 
47
+ + The model has adopted the use of pretrained classifiers, each trained on data not specific to our domain.
48
 
49
+ + The models inherit all biases previously encoded in the model. We have not investigated how these biases may affect downstream tasks.
50
 
51
+ + The finetuned models generated few errors in our investigation, however, we are unaware of how these biases may result in unintended effects.
52
 
53
+ + This work is an initial investigation into Mixtec and low- resource, semasiographic languages. We are prohibited from deeper explorations until we align our research direction with present communal, cultural, and anthropological needs. Support from Mixtec domain experts and native Mixtec speakers is essential for continued development.
54
 
 
55
 
56
  ### Recommendations
57
 
58
+ Given that the model can reliably classify figures from a low-resource dataset, this research opens the door for further processing and analysis of Mixtec Codices. The codices themselves are highly structured and carry a narrative woven through each scene. Finetuned state-of-the-art models could be combined to classify segmented figures within a scene, as well as classify the relationship between figures. These relationships would then be used to extract the narrative from a codex, as defined by domain experts.
 
 
59
 
60
  ## How to Get Started with the Model
61
 
62
+ ```python
63
+
64
+ from transformers import ViTFeatureExtractor,ViTForImageClassification
65
+ from PIL import Image
66
+ import torch
67
+ import requests
68
+ from io import BytesIO
69
+
70
+ # Load the feature extractor and model
71
+ model_name = "ufdatastudio/vit-pose"
72
+ feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
73
+ model = ViTForImageClassification.from_pretrained(model_name)
74
+
75
+ img = Image.open("<link_to_image>").convert("RGB")
76
+
77
+ # Preprocess the image
78
+ inputs = feature_extractor(images=img, return_tensors="pt")
79
+
80
+ # Run inference (classify the image)
81
+ with torch.no_grad():
82
+ outputs = model(**inputs)
83
+
84
+ # Get predicted class
85
+ predicted_class_idx = outputs.logits.argmax(-1).item()
86
+ labels = model.config.id2label # get labels
87
+ predicted_label = labels[predicted_class_idx]
88
+
89
+ # Print the result
90
+ print(f"Predicted Label: {predicted_label}")
91
+ ```
92
 
93
  ## Training Details
94
 
95
  ### Training Data
96
+ The dataset used for the training of this model can be found at: https://huggingface.co/datasets/ufdatastudio/mixtec-figures
97
 
98
+ #### **Dataset Generation**
99
 
100
+ + Extracted labelled data from 3 Codices:
101
 
102
+ 1. **Vindobonensis Mexicanus (65 pages)**: Describes both the mythological and historical founding of the first Mixtec kingdoms.
103
 
104
+ 2. **Selden (20 pages)**: Follows the founding of the kingdom of Jaltepec and its ruler, Lady 6 Monkey.
105
 
106
+ 3. **Zouche-Nuttall (facsimile edition (40 pages))**: Illustrates the life and conquests of Lord 8 Deer Jaguar Claw, but also details the histories of his ancestors.
107
 
108
+ > Note: Other Mixtex Codices are extant, but their condition is degraded and not amenable to our current machine-learning pipeline. Each codex is made of deerskin folios, and each folio comprises two pages.
109
 
110
+ + **Extraction Method**: We used the [Segment Anything Model (SAM) from Facebook AI Research](https://segment-anything.com/) to extract individual figures from the three source codices.
111
 
112
+ + Each figure was annotated according to the page it was found, its quality as either a, b, or c, and its order within the page.
113
 
114
+ **a**. quality rating indicated the entire figure was intact, regard- less of minor blemishes or cracking, and could be classified by a human annotator as man or woman, standing or not.
115
 
116
+ **b**. rating means that while the previous characteristics of the figure could be de- termined, significant portions of the figures were missing or damaged.
117
 
118
+ **c**. rated figures were missing most of the definable characteristics humans could use to classify the sample.
119
 
120
+ + **Data Labelling**: After figure segmentation and grading, we added classification labels to each figure (standing/not standing).
121
 
122
+ + **Literature used for evaluation of figures**: Boone 2000; Smith, 1973; Jansen, 1988; Williams, 2013; Lopez, 2021.
123
 
124
+ + **Criteria used to determine standing and not standing**: If the figure is clearly on two feet and in an upright position, it is labeled standing, and any other position is labeled not standing.
125
 
126
+ + Two team members tagged the images for both categories independently and then verified the results with each other using the process of inter-rater reliability.
127
 
128
+ ### Training Procedure
129
 
130
+ #### **Preprocessing**
131
 
132
+ + Figures are moved to tensors and then normalized to 224x224 pixels.
133
 
134
+ + Loss function is biased by weighting each class in the loss function by its inverse.
135
 
136
+ + Due to the overall limited number of figures, and to prevent overfitting, the entire dataset was augmented by using random flips and blocking to increase the number of samples for training.
137
 
138
+ + The dataset is split into training, testing, and validation sets, 60%, 20%, and 20% respectively.
139
 
140
+ + Eight reference images were set aside to monitor which features of pose are prevalent in activation and attention maps throughout training.
141
 
142
+ #### **Model Training**
143
 
144
+ + We fine-tuned popular vision model ViT-16 to perform classification tasks and improve computational efficiency.
145
 
146
+ + Imported the model and its pre-trained weights from the PyTorch library, then unfroze the last four layers and heads of the model for training, as they are responsible for learning complex features specific to our classification tasks.
147
 
148
+ + Replaced the fully connected layer by one matching our binary classification task.
149
 
150
+ + Before the first and after the last epoch of training, an a an attention map is output for each reference image.
151
 
152
 
153
+ ### **Hyperparamter Tuning**
154
 
155
+ + Experimented with different batch sizes, ranging from 32 to 128, and opted for an average value of 64 as no size significantly outperformed the others.
156
 
157
+ + Selected the loss function and optimizer according to the best practices associated with ViT.
158
 
159
+ + Hyperparameter investigations revealed that the accuracy for training and validation converged around 100 epochs and the ideal learning rate was 0.00025.
160
 
161
+ ### **Model Evaluation**
162
 
163
+ + For each training and validation run, we collected metrics such as accuracy, F1, recall, loss, and precision.
164
 
165
+ + The testing accuracy was around 98% with a standard deviation of 1%.
166
 
 
 
 
 
 
167
 
168
+ ### Testing Data, Factors & Metrics
169
 
170
+ #### Testing Data
171
 
172
+ The test set was 20% of the overall dataset, comprising 260 figures from all three codices.
173
 
 
174
 
175
+ #### Factors
176
+
177
+ In the dataset the number of images labelled as 'Not standing' outweighs images labelled as 'standing'. The reason for this is unclear, although given the number of ceremonies that each codex describes, which entails a seated or kneeling position, this balance intuitively makes sense.
178
 
179
+ #### Metrics
180
 
181
+ The model’s performance was evaluated using accuracy, precision, recall, and F1 scores. Both models performed with around 98% accuracy, with ViT-16 outperforming VGG-16 in some configurations.
182
 
183
+ ### Results
184
 
185
+ The purpose of bulding the model was to answer the questions:
186
+
187
+ 1. **Can transformer-based models be finetuned to classify figures from a Mixtec Codices dataset?**
188
 
189
+ Yes!, the model achieved great results across training, validation, and testing phases when using an appropriate learning rate.
190
 
191
+ 2. **Does the model identify the same features experts do?**
192
 
193
+ + We assigned reference images for each class (man and woman, and standing/not standing) to understand which features each model learned, as well as to compare these learned features to those highlighted by experts.
194
 
195
+ + During training, we generated visualizations of activation and attention per pixel to view how the models learned important features over time.
196
 
197
+ <!-- ![Alt text](image-1.png) -->
198
 
199
+ + The ViT model assigned higher attention to areas corresponding to loincloths on man and showed increased attention to the poncho area on a woman.
200
 
201
+ + To verify that the model is indeed identifying the same features noted in literature, we masked attributes on the reference images.
202
 
203
+ + We extended our reference image set by adding three variations to each image: either blocked hair, blocked skirt, or both for woman. This process was replicated for the two features indicative of man.
204
 
205
+ + ViT correctly predicted 100% of the unblocked reference images, 79% of the singly blocked images, and 63% of the double blocked images.
206
 
207
+ + For the doubly blocker images the model fails to find defined areas of attention. This verifies that the model is learning features defined in literature.
208
 
 
209
 
210
+ #### Summary
211
 
212
+ We presented a low-resource dataset of figures from three Mixtec codices: Zouche-Nuttall, Selden, and Vindobonensis Mexicanus I. We extracted the figures using Segment Anything Model and labeled them according to pose, a critical feature used to understand Mixtec codices. Using this novel dataset, we finetuned the last few layers transformer-based foundational models ViT-16, to classify figures as standing or not standing. We confirmed that the model is learning the features said to be relevant by experts using class activation maps and targeted blocking of said features.
213
 
214
+ ## Environmental Impact
215
 
216
+ We have not yet explored more environmentally efficient models. The environmental impact is the same as that of the Vision Transformer models.
217
+
218
+ ## Technical Specifications
219
+
220
+ ### Compute Infrastructure
221
+
222
+ #### Hardware
223
+
224
+ Model training and inference were performed on an Nvidia A100 on the HiPerGator cluster using PyTorch 2.1 and CUDA 11.
225
+
226
+ #### Software
227
+
228
+ PyTorch framework
229
+
230
+ ## Citation
231
+
232
+ **BibTeX:**
233
+ ```BibTeX
234
+ @inproceedings{webber-etal-2024-analyzing,
235
+ title = "Analyzing Finetuned Vision Models for {M}ixtec Codex Interpretation",
236
+ author = "Webber, Alexander and
237
+ Sayers, Zachary and
238
+ Wu, Amy and
239
+ Thorner, Elizabeth and
240
+ Witter, Justin and
241
+ Ayoubi, Gabriel and
242
+ Grant, Christan",
243
+ editor = "Mager, Manuel and
244
+ Ebrahimi, Abteen and
245
+ Rijhwani, Shruti and
246
+ Oncevay, Arturo and
247
+ Chiruzzo, Luis and
248
+ Pugh, Robert and
249
+ von der Wense, Katharina",
250
+ booktitle = "Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)",
251
+ month = jun,
252
+ year = "2024",
253
+ address = "Mexico City, Mexico",
254
+ publisher = "Association for Computational Linguistics",
255
+ url = "https://aclanthology.org/2024.americasnlp-1.6",
256
+ doi = "10.18653/v1/2024.americasnlp-1.6",
257
+ pages = "42--49",
258
+ abstract = "Throughout history, pictorial record-keeping has been used to document events, stories, and concepts. A popular example of this is the Tzolk{'}in Maya Calendar. The pre-Columbian Mixtec society also recorded many works through graphical media called codices that depict both stories and real events. Mixtec codices are unique because the depicted scenes are highly structured within and across documents. As a first effort toward translation, we created two binary classification tasks over Mixtec codices, namely, gender and pose. The composition of figures within a codex is essential for understanding the codex{'}s narrative. We labeled a dataset with around 1300 figures drawn from three codices of varying qualities. We finetuned the Visual Geometry Group 16 (VGG-16) and Vision Transformer 16 (ViT-16) models, measured their performance, and compared learned features with expert opinions found in literature. The results show that when finetuned, both VGG and ViT perform well, with the transformer-based architecture (ViT) outperforming the CNN-based architecture (VGG) at higher learning rates. We are releasing this work to allow collaboration with the Mixtec community and domain scientists.",
259
+ }
260
+ ```
261
+
262
+ ## Glossary
263
+
264
+ Figures: Representations of people or gods in Mixtec mythology and are composed of different outfits, tools, and positions. Their names are represented by icons placed near their position on a page.
265
+
266
+ <!-- ## Model Card Authors [optional] -->
267
+
268
+ ## Model Card Contact
269
+ https://ufdatastudio.com/contact/