I want to use this model to identify actions, such as falls, which cannot be judged by a single image.
Same question with you , any new ideas now?
Hi, refer to V-BLIP for video captioning: https://huggingface.co/models?other=video-captioning
· Sign up or log in to comment