Spaces:

llinahosna
/

Mini-dalle_video-clip_maker

Runtime error

App Files Files Community

llinahosna commited on Feb 27, 2024

Commit

d2b22fb

•

1 Parent(s): 34dc018

Update README.md

Browse files

Files changed (1) hide show

README.md +165 -0

README.md CHANGED Viewed

@@ -11,3 +11,168 @@ license: mit
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# DALL·E Video Clip Maker
+This python project uses DALL·E Mini, through Replicate's API, to generate a photo-montage video
+from a song.
+Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics
+in the transcript as text prompts for DALL·E Mini.
+## Usage
+The project can be accessed through a single command:
+`python3 main.py <youtube url> --token <your replicate API token>`
+An output example for the video in "Here Comes the Sun" by The Beatles:
+<img src="misc/frame-432.png" width="250"/> <img src="misc/frame-177.png" width="250"/>
+<img src="misc/frame-316.png" width="250"/> <img src="misc/frame-633.png" width="250"/>
+<img src="misc/frame-1724.png" width="250"/> <img src="misc/frame-1328.png" width="250"/>
+Note that the project only works with YouTube videos that have a transcription.
+# Blog Post
+## 1. Interacting with the Replicate API to run DALL·E Mini
+[Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project.
+All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate.
+Let's have a look at the code it runs in order to generate images from text.
+In order to create an API object and specify the model we'd like to use, we first need an API token
+which is available [here](https://replicate.com/docs/api) after subscribing to Replicate.
+```
+import os
+import replicate
+os.environ["REPLICATE_API_TOKEN"] = <Your Api access token from Replicate>
+dalle = replicate.models.get("kuprel/min-dalle")
+urls = self.dalle.predict(text=<your prompt>, grid_size=<How many images to generate>, log2_supercondition_factor=<A parameter controlling the output relevance to text>)
+```
+In this case, the model returns a list of urls to all intermediate images generated by DALL·E Mini.
+We want the final output, so we call
+```
+get_image(list(urls)[-1])
+```
+to download the last one using python's urllib library
+## 2. Downloading content from YouTube
+All the code in this section appears in **download_from_youtube.py**.
+### Downloading the transcript
+There is a very cool python package called **YouTubeTranscriptApi** and, as its name implies, it's going to be very usefull
+The **YouTubeTranscriptApi.get_transcript** function needs a youtube video ID, so we'll first extract if from the video url using urllib:
+The function **get_video_id** in the file does exactly that
+and the main lines of code to get the transcripts are:
+```
+id = get_video_id(url)
+transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en'])
+```
+str is a python dictionary with entries 'text', 'star' 'duration'
+indicating the starting time of each line of the lyrics and its duration.
+### Downloading the audio
+I used a library called **youtube_dl** that can download an .mp3 file with the sound of a YouTube video.
+The usage is fairly simple and is wrapped in the download_mp3 function in the file
+```
+import youtube_dl
+ydl_opts = {
+    'outtmpl': <specify output file path>,
+    'format': 'bestaudio/best',
+    'postprocessors': [{
+        'key': 'FFmpegExtractAudio',
+        'preferredcodec': 'mp3',
+        'preferredquality': '192',
+    }],
+}
+with youtube_dl.YoutubeDL(ydl_opts) as ydl:
+    ydl.download([url])
+```
+## 3. Making a video clip
+The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL·E Mini, we get images and combine
+them with the .mp3 to video clip.
+In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together.
+Let's go through the code:
+We loop over the transcript dictionary we previously downloaded:
+```
+for (text, start, end) in transcript:
+```
+Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need.
+Also, DALL·E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of
+dimension <pre xml:lang="latex">\sqrt{N}</pre>. The calculation is:
+```
+    grid_size = max(get_sqrt(duration / args.sec_per_img), 1)
+```
+Now we ask Replicate for images from DALL·E Mini:
+```
+    images = dalle.generate_images(text, grid_size, text_adherence=3)
+```
+If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can
+change image more frequently) we usually need to write each image for multiple frames.
+The calculation I did is:
+```
+    frames_per_image = int(duration * args.fps) // len(images)
+```
+Now, we use **opencv** package to write the lyrics as subtitles on the frame
+```
+    frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR)
+    frame = put_subtitles_on_frame(frame, text, resize_factor)
+    frames.append(frame)
+```
+Where **put_subtitles_on_frame** is a function in utils.py that makes use of the **cv2.putText** function
+Finally, we can write all the aggregated frames into a file:
+```
+    video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim))
+    for i, frame in enumerate(frames):
+        video.write(frame)
+    cv2.destroyAllWindows()
+    video.release()
+```
+The code itself is in  the **get_frames** function in **main.py** and is a little bit more elaborated. It also fills the
+gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name.
+## 4. Sound and video mixing
+Now that we have video, we only need to mix it with the downloaded .mp3 file.
+We'll use FFMPEG for this with Shell commands executed from python.
+The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics
+doesn't cover all the song. The second command mixes the two into a new file with video and song:
+```
+os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'")
+os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'")
+```
+# TODO
+- [ ] Fix subtitles no whitespace problems
+- [ ] Allow working on raw .mp3 and .srt files instead of urls only
+- [ ] Support automatic generated youtube transcriptions
+- [ ] Better timing of subtitles and sound
+- [ ] Find way to upload video without copyrights infringement
+- [ ] Use other text to image models from Replicate