GPT4o-Azure-Caption-Pixel-back

Runtime error

App Files Files Community

lalalalalalalalalala commited on Jun 18, 2024

Commit

9ea9a8c

verified ·

1 Parent(s): 44005dd

Update constraint.py

Browse files

Files changed (1) hide show

constraint.py +86 -1

constraint.py CHANGED Viewed

@@ -1,6 +1,91 @@
 SYS_PROMPT = ""
-USER_PROMPT = """Create a detailed and accurate video description, starting from a specific scene and possibly transitioning through various themes and settings. Begin by describing the initial scene in detail, including the environment, key objects, any characters or creatures and their actions, and the overall atmosphere, considering specific aspects such as shot sizes (extreme close-up, close-up, medium, full, wide, etc.), camera movements (push, pull, shake, pan, tilt, rise, descend, etc.), and more. For example, if the scene involves a person like a young man sitting on a chair reading a book, describe his appearance and the surrounding environment, including basic features such as the character's gender, age, race, etc., as well as actions, emotions, dialogues, and performance content. If the scene includes animals or natural elements such as cats, the sky, or landscapes, vividly describe these elements and their behaviors or states, and consider the emotions and thematic elements introduced in this opening scene. Then, as the video progresses, describe the evolving visual effects, how they present a more vivid and rich picture through camera movements and special effects, considering aesthetics (style, tone, color palette, atmosphere, emotions, etc.). If the scene changes, explain how it transitions, what new elements are introduced, whether the atmosphere remains consistent or changes, and how this affects the overall narrative or theme of the video. If the video contains multiple scenes, describe the connections between them, whether creating a story, presenting a contrast, or highlighting different aspects of a theme, considering scenes (day, night, indoor, outdoor, etc.), props (relationship with characters and scenes, relationship with camera and scheduling), and scene scheduling (single character, multiple characters with camera and narrative association, and how they relate to scene props). Finally, conclude with a summary that encapsulates the essence of the video, combining all the described elements into a cohesive narrative or message, emphasizing the sensory and emotional experience provided by the video, and speculating on the impact or message intended for the audience, allowing viewers to engage in profound reflection and insight during the viewing process, thus achieving a deeper impact. The generated description should adhere to English grammar and be no less than 120 words in length.
 """
 SKIP = 2

 SYS_PROMPT = ""
+USER_PROMPT = """# CONTEXT #
+You are a powerful video captioner.I want to tag 200,000 video files for use in training a text-to-video dataset. The purpose of the video tags is to train a text-to-video model. You need to provide a structured, detailed, and accurate description of the given video.
+# OBJECTIVE #
+Video Description Task Instructions
+Video Content Description:
+Detail and Accuracy: Provide a detailed and accurate description of the video content. Include all key objects, their types, colors, actions, positions, and relative positions. Describe the overall atmosphere.
+Persons and Animals: If there are people, describe their appearance and actions. If there are animals, describe their behavior to give a clear understanding of the scene.
+Multiple Scenes: If the video has multiple scenes, describe how they transition and highlight the differences between them.
+Objectivity: Do not include imagined content or overly subjective feelings. Ensure all descriptions are based on what can be confidently determined from the video.
+Grammar and Length: Use correct English grammar. Each descriptive sentence should be at least three sentences long.
+Video Quality Evaluation:
+Aesthetic Value: Evaluate the aesthetic value, including composition, color harmony, and overall visual effect. Score this aspect from 1 to 5 and explain your reasoning.
+Clarity: Assess the clarity, including resolution and detail presentation. Score this aspect from 1 to 5 and explain your reasoning.
+Emotional Impact: Evaluate the emotional impact, including how well the video conveys emotions and resonates with the audience. Score this aspect from 1 to 5 and explain your reasoning.
+Summary: Provide a summary of the scores for aesthetic value, clarity, and emotional impact.
+Film Perspective Analysis:
+Shot Analysis: Analyze the type of shots used (close-up, medium, long shot, etc.).
+Camera Movements: Describe the camera movements (push, pull, pan, tilt, track, crane, etc.).
+Composition: Analyze the composition of the shots.
+Interpretation: Provide your interpretation and feelings about the photographic work.
+# STYLE #
+cinematic language，such as narrative techniques, visual aesthetics, editing styles, and sound design.
+# Output Structure #
+Video Content:
+{Detailed description of the video here, meeting the above requirements}.
+Video Quality:
+{Evaluation score and explanation of the video quality here}.
+Film Perspective Description:
+{Analysis of the video from a film perspective here}.
+Example:
+Video Content:
+A stylish woman strides down a Tokyo street illuminated by warm neon lights and animated city signage. She sports a black leather jacket, a long red dress, black boots, and carries a black purse. Her look is completed with sunglasses and red lipstick. Her demeanor is confident and casual. The damp street reflects the vibrant lights, creating a mirror effect. The scene is bustling with numerous pedestrians.
+Video Quality:
+Aesthetic Value:
+- Composition and Color: The video showcases a well-balanced composition with harmonious color schemes, achieving a visually pleasing effect. Techniques such as symmetry and dynamic composition are skillfully employed.
+- Camera Work: The visual experience is enhanced by smooth transitions and diverse angles.
+- Score: 4/5
+Clarity:
+- Resolution: The video boasts high resolution with clear details.
+- Detail Presentation: It presents rich details with no noticeable blurriness or distortion.
+- Score: 5/5
+Emotional Impact:
+- Emotion Conveyance: The video successfully conveys joy and excitement, striking a chord with the audience.
+- Resonance: The compelling emotional expression, supported by well-integrated music and visuals, creates a strong impact.
+- Score: 4/5
+Summary:
+- Aesthetic Value: 4/5
+- Video Clarity: 5/5
+- Emotional Impact: 4/5
+Film Perspective Description:
+Characters:
+- Woman: A stylish woman dressed in a black leather jacket, long red dress, black boots, and carrying a black purse. She wears sunglasses and red lipstick.
+Scenes:
+- Tokyo Street: The street is filled with warm glowing neon lights and animated city signage, with damp reflective surfaces and numerous pedestrians.
+Shot 1:
+- The woman walks confidently and casually down the Tokyo street.
+- She heads towards the camera in a panoramic view with central composition. The camera is at eye level and follows her with a handheld shot.
+- Duration: 36 seconds
+Shot 2:
+- The woman continues her walk down the Tokyo street, maintaining her confident and casual demeanor.
+- She approaches the camera, with a close-up of her face, transitioning to a torso mid-shot. The camera remains at eye level, following her with a handheld shot.
+- Duration: 24 seconds
 """
 SKIP = 2