File size: 4,527 Bytes
b66c770
 
 
 
 
 
 
 
 
 
 
6e89871
 
 
 
00e61fe
6e89871
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e1f08b
 
 
 
 
 
 
 
 
 
 
 
 
 
6e89871
e3012f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
title: Actors matching
emoji: 🎬
colorFrom: yellow
colorTo: orange
sdk: gradio
app_file: app.py
pinned: true
---

# Actors matching demo

Who should play Hannibal (the Carthaginian, not the cannibal) if HBO ever adapts his story? How about you? Who should be your actor?
This application lets you input an image and see the top three actors that more closely resemble the image based on facial features.

Try it out on my [HugginFace Space](https://huggingface.co/spaces/nbeuchat/actors_matching)  


## Data

The data comes from two sources:

1. I built a list of relevant actors that have been in popular movies across their careers. The datasets that I used to build can be found on the [IMDB datasets page](https://datasets.imdbws.com/) (see instructions [here](https://www.imdb.com/interfaces/))
2. I then found 20 images of each actor using Microsoft Bing Search API using queries such as *"Brad Pitt, actor or actress"*

Note that due to API limits, I only took images from 1,000 actors. 

## Application

The application is built with Gradio and deployed on HuggingFace Space. In the background, it uses:

1. The [`face_recognition` library](https://github.com/ageitgey/face_recognition) to extract the location of faces in the image and compute an embedding of these faces
2. Spotify's `annoy` library to efficiently search the closest actors based on the face embedding and a small database of actors' faces embeddings. 
3. Show you the best matches!

This is meant to be a fun and tiny application. There are known issues and biases. 

## Known biases and limitations

There are a few issues with the dataset and models used:

- The dataset of actors is limited to a couple thousands actors and actresses and it is therefore not representative of the richness of professionals out there
- The subset of actors and actresses selected is based on an aggregated metrics that considers all movies and shows in which the person was listed as an actor/actress. It is the weighted sum of the number of IMDb votes for this movie/show, weighted by the average IMDb score. This is obviously only a rough indicator of popularity but provided me with a quick way of getting a dataset with actors that people may know.
- Given the above, the database sampling will have several biases that are intrinsic to (a) the IMDb database and user base itself which is biased towards western/American movies, (b) the movie industry itself with a dominance of white male actors
- The pictures of actors and actresses was done through a simple Bing Search and not manually verified, there are several mistakes. For example, Graham Greene has a mix of pictures from Graham Greene, the canadian actor, and Graham Greene, the writer. You may get surprising results from time to time! Let me know if you find mistakes

## Next steps

- Better image dataset (ie: identify and clean-up errors where multiple people where queried in the Bing Search)
- Larger dataset and more balanced dataset (to reduce the bias toward white male actors)
- Provide a way of looping through multiple people in a picture in the Gradio app
- Currently, I find the best matching actor using the average embedding for the actor. I plan to then do a second pass to find the closest matching picture(s) of this specific actor for a better user experience. 
- Deeper analysis of which embedding dimensions are necessary. Might want to reweight them.

## Credits

Author: Nicolas Beuchat ([email protected])

Thanks to the following open-source projects:

- [dlib](https://github.com/davisking/dlib) by [Davis King](https://github.com/davisking) ([@nulhom](https://twitter.com/nulhom))
- [face_recognition](https://github.com/ageitgey/face_recognition) by [Adam Geitgey](https://github.com/ageitgey)
- [annoy](https://github.com/spotify/annoy) by Spotify 

Example images used in the Gradio app (most under [Creative Commons Attribution license](https://en.wikipedia.org/wiki/en:Creative_Commons)):

- [RB Ginsburg](https://www.flickr.com/photos/tradlands/25602059686) - CC
- [Frederik Douglass](https://commons.wikimedia.org/wiki/File:Frederick_Douglass_1856_sq.jpg) - CC
- [Leonardo da Vinci](https://commons.wikimedia.org/wiki/File:Leonardo_da_Vinci._Photograph_by_E._Desmaisons_after_a_print_Wellcome_V0027541EL.jpg) - CC
- [Hannibal Barca](https://en.wikipedia.org/wiki/Hannibal#/media/File:Mommsen_p265.jpg) - Public domain
- [Joan of Arc](https://de.wikipedia.org/wiki/Jeanne_d%E2%80%99Arc#/media/Datei:Joan_of_Arc_miniature_graded.jpg) - Public domain