Spaces:

llinahosna
/

Mini-dalle_video-clip_maker

Runtime error

App Files Files Community

Mini-dalle_video-clip_maker / README.md

llinahosna

Update README.md

d2b22fb verified 11 months ago

preview code

raw

history blame

6.77 kB

	---
	title: Mini-dalle Video-clip Maker
	emoji: 🐨
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: 4.19.2
	app_file: app.py
	pinned: false
	license: mit
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
	# DALL·E Video Clip Maker

	This python project uses DALL·E Mini, through Replicate's API, to generate a photo-montage video
	from a song.

	Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics
	in the transcript as text prompts for DALL·E Mini.

	## Usage

	The project can be accessed through a single command:

	`python3 main.py <youtube url> --token <your replicate API token>`

	An output example for the video in "Here Comes the Sun" by The Beatles:

	<img src="misc/frame-432.png" width="250"/> <img src="misc/frame-177.png" width="250"/>
	<img src="misc/frame-316.png" width="250"/> <img src="misc/frame-633.png" width="250"/>
	<img src="misc/frame-1724.png" width="250"/> <img src="misc/frame-1328.png" width="250"/>

	Note that the project only works with YouTube videos that have a transcription.

	# Blog Post

	## 1. Interacting with the Replicate API to run DALL·E Mini

	[Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project.

	All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate.

	Let's have a look at the code it runs in order to generate images from text.

	In order to create an API object and specify the model we'd like to use, we first need an API token
	which is available [here](https://replicate.com/docs/api) after subscribing to Replicate.

	```
	import os
	import replicate
	os.environ["REPLICATE_API_TOKEN"] = <Your Api access token from Replicate>
	dalle = replicate.models.get("kuprel/min-dalle")
	urls = self.dalle.predict(text=<your prompt>, grid_size=<How many images to generate>, log2_supercondition_factor=<A parameter controlling the output relevance to text>)
	```

	In this case, the model returns a list of urls to all intermediate images generated by DALL·E Mini.

	We want the final output, so we call
	```
	get_image(list(urls)[-1])
	```
	to download the last one using python's urllib library

	## 2. Downloading content from YouTube
	All the code in this section appears in download_from_youtube.py.

	### Downloading the transcript

	There is a very cool python package called YouTubeTranscriptApi and, as its name implies, it's going to be very usefull

	The YouTubeTranscriptApi.get_transcript function needs a youtube video ID, so we'll first extract if from the video url using urllib:
	The function get_video_id in the file does exactly that

	and the main lines of code to get the transcripts are:

	```
	id = get_video_id(url)
	transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en'])
	```
	str is a python dictionary with entries 'text', 'star' 'duration'
	indicating the starting time of each line of the lyrics and its duration.

	### Downloading the audio

	I used a library called youtube_dl that can download an .mp3 file with the sound of a YouTube video.

	The usage is fairly simple and is wrapped in the download_mp3 function in the file
	```
	import youtube_dl
	ydl_opts = {
	'outtmpl': <specify output file path>,
	'format': 'bestaudio/best',
	'postprocessors': [{
	'key': 'FFmpegExtractAudio',
	'preferredcodec': 'mp3',
	'preferredquality': '192',
	}],
	}
	with youtube_dl.YoutubeDL(ydl_opts) as ydl:
	ydl.download([url])
	```

	## 3. Making a video clip
	The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL·E Mini, we get images and combine
	them with the .mp3 to video clip.

	In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together.

	Let's go through the code:

	We loop over the transcript dictionary we previously downloaded:

	```
	for (text, start, end) in transcript:
	```

	Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need.
	Also, DALL·E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of
	dimension <pre xml:lang="latex">\sqrt{N}</pre>. The calculation is:

	```
	grid_size = max(get_sqrt(duration / args.sec_per_img), 1)
	```

	Now we ask Replicate for images from DALL·E Mini:
	```
	images = dalle.generate_images(text, grid_size, text_adherence=3)
	```
	If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can
	change image more frequently) we usually need to write each image for multiple frames.

	The calculation I did is:
	```
	frames_per_image = int(duration * args.fps) // len(images)
	```

	Now, we use opencv package to write the lyrics as subtitles on the frame
	```
	frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR)
	frame = put_subtitles_on_frame(frame, text, resize_factor)
	frames.append(frame)
	```
	Where put_subtitles_on_frame is a function in utils.py that makes use of the cv2.putText function

	Finally, we can write all the aggregated frames into a file:
	```
	video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim))
	for i, frame in enumerate(frames):
	video.write(frame)
	cv2.destroyAllWindows()
	video.release()
	```

	The code itself is in the get_frames function in main.py and is a little bit more elaborated. It also fills the
	gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name.

	## 4. Sound and video mixing

	Now that we have video, we only need to mix it with the downloaded .mp3 file.

	We'll use FFMPEG for this with Shell commands executed from python.

	The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics
	doesn't cover all the song. The second command mixes the two into a new file with video and song:

	```
	os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'")
	os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'")
	```

	# TODO
	- [ ] Fix subtitles no whitespace problems
	- [ ] Allow working on raw .mp3 and .srt files instead of urls only
	- [ ] Support automatic generated youtube transcriptions
	- [ ] Better timing of subtitles and sound
	- [ ] Find way to upload video without copyrights infringement
	- [ ] Use other text to image models from Replicate