|
--- |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- camera level |
|
- camera feature |
|
- movie analysis |
|
metrics: |
|
- accuracy |
|
- f1 |
|
pipeline_tag: image-classification |
|
--- |
|
|
|
# Convnextv2 finetuned for camera level classification |
|
|
|
Convnextv2 base-size model finetuned for the classification of camera angles. [Cinescale](https://cinescale.github.io/camera_al/#dataset) dataset is used to finetune the model for 20 epochs. |
|
|
|
Classifies an image into six classes: *aerial, eye, ground, hip, knee, shoulder* |
|
|
|
|
|
## Evaluation |
|
|
|
On the test set (test.csv), the model has an accuracy of 89.82% and macro-f1 of 82.31% |
|
|
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import AutoModelForImageClassification |
|
import torch |
|
from torchvision.transforms import v2 |
|
from torchvision.io import read_image, ImageReadMode |
|
|
|
model = AutoModelForImageClassification.from_pretrained("gullalc/convnextv2-base-22k-384-cinescale-level") |
|
im_size = 384 |
|
|
|
# https://www.pexels.com/photo/aerial-view-of-city-buildings-8783146/ |
|
image = read_image("demo/level_demo.jpg", mode=ImageReadMode.RGB) |
|
|
|
transform = v2.Compose([v2.Resize((im_size,im_size), antialias=True), |
|
v2.ToDtype(torch.float32, scale=True), |
|
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]) |
|
|
|
inputs = transform(image).unsqueeze(0) |
|
|
|
with torch.no_grad(): |
|
outputs = model(pixel_values=inputs) |
|
|
|
|
|
predicted_label = model.config.id2label[torch.argmax(outputs.logits).item()] |
|
print(predicted_label) |
|
# --> aerial |
|
``` |