Dataset usage
Fascinating idea! Which dataset did you use for fine tuning?
I use video dataset I collect it. I try to do continuous sign language recognition task but I got very bad accuracy, my dataset is skeleton video extracted from the original dataset. any suggestions?
Is it words or sentences? I reckon the temporal dimension makes it tricky/ requires a massive amount of samples. How do you extract the skeletons? Do you model time simultaneously or separately?
It is sentences, each sentence have 3 to 5 word. I extract skeleton by mediaPip and redraw skeleton key points only and use it as input video. I have 15 sentence. number of videos for each sentence is between 24 to 38 video. I follow the example code at "video classification using transformers" provided by hugging face site.