Spaces:
Running
Running
# ========== (c) JP Hwang 25/9/2022 ========== | |
import logging | |
import pandas as pd | |
import numpy as np | |
import streamlit as st | |
import plotly.express as px | |
from scipy import spatial | |
import random | |
# ===== SET UP LOGGER ===== | |
logger = logging.getLogger(__name__) | |
root_logger = logging.getLogger() | |
root_logger.setLevel(logging.INFO) | |
sh = logging.StreamHandler() | |
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') | |
sh.setFormatter(formatter) | |
root_logger.addHandler(sh) | |
# ===== END LOGGER SETUP ===== | |
desired_width = 320 | |
pd.set_option('display.max_columns', 20) | |
pd.set_option('display.width', desired_width) | |
sizes = [1, 20, 30] | |
def get_top_tokens(ser_in): | |
from collections import Counter | |
tkn_list = '_'.join(ser_in.tolist()).split('_') | |
tkn_counts = Counter(tkn_list) | |
common_tokens = [i[0] for i in tkn_counts.most_common(10)] | |
return common_tokens | |
def build_chart(df_in): | |
fig = px.scatter_3d(df_in, x='r', y='g', z='b', | |
template='plotly_white', | |
color=df_in['simple_name'], | |
color_discrete_sequence=df_in['rgb'], | |
size='size', | |
hover_data=['name']) | |
fig.update_layout( | |
showlegend=False, | |
margin=dict(l=5, r=5, t=20, b=5) | |
) | |
return fig | |
def preproc_data(): | |
df = pd.read_csv('data/colors.csv', names=['simple_name', 'name', 'hex', 'r', 'g', 'b']) | |
# Preprocessing | |
df['rgb'] = df.apply(lambda x: f'rgb({x.r}, {x.g}, {x.b})', axis=1) | |
# Get top 'basic' color names | |
df = df.assign(category=df.simple_name.apply(lambda x: x.split('_')[-1])) | |
# Set default size attribute | |
df['size'] = sizes[0] | |
return df | |
def get_top_colors(df): | |
top_colors = df['category'].value_counts()[:15].index.tolist() | |
top_colors = [c for c in top_colors if c in df.simple_name.values] | |
return top_colors | |
def main(): | |
st.title('Colorful vectors') | |
st.markdown(""" | |
You might have heard that objects like | |
words or images can be represented by "vectors". | |
What does that mean, exactly? It seems like a tricky concept, but it doesn't have to be. | |
Let's start here, where colors are represented in 3-D space 🌈. | |
Each axis represents how much of primary colors `(red, green, and blue)` | |
each color comprises. | |
For example, `Magenta` is represented by `(255, 0, 255)`, | |
and `(80, 200, 120)` represents `Emerald`. | |
That's all a *vector* is in this context - a sequence of numbers. | |
Take a look at the resulting 3-D image below; it's kind of mesmerising! | |
(You can spin the image around, as well as zoom in/out.) | |
""" | |
) | |
df = preproc_data() | |
fig = build_chart(df) | |
st.plotly_chart(fig) | |
st.markdown(""" | |
### Why does this matter? | |
You see here that similar colors are placed close to each other in space. | |
It seems obvious, but **this** is the crux of why a *vector representation* is so powerful. | |
These objects being located *in space* based on their key property (`color`) | |
enables an easy, objective assessment of similarity. | |
Let's take this further: | |
""") | |
# ===== SCALAR SEARCH ===== | |
st.header('Searching in vector space') | |
st.markdown(""" | |
Imagine that you need to identify colors similar to a given color. | |
You could do it by name, for instance looking for colors containing matching words. | |
But remember that in the 3-D chart above, similar colors are physically close to each other. | |
So all you actually need to do is to calculate distances, and collect points based on a threshold! | |
That's probably still a bit abstract - so pick a 'base' color, and we'll go from there. | |
In fact - try a few different colors while you're at it! | |
""") | |
top_colors = get_top_colors(df) | |
# def_choice = random.randrange(len(top_colors)) | |
query = st.selectbox('Pick a "base" color:', top_colors, index=5) | |
match = df[df.simple_name == query].iloc[0] | |
scalar_filter = df.simple_name.str.contains(query) | |
st.markdown(f""" | |
The color `{match.simple_name}` is also represented | |
in our 3-D space by `({match.r}, {match.g}, {match.b})`. | |
Let's see what we can find using either of these properties. | |
(Oh, you can adjust the similarity threshold below as well.) | |
""") | |
with st.expander(f"Similarity search options"): | |
st.markdown(f""" | |
Do you want to find lots of similar colors, or | |
just a select few *very* similar colors to `{match.simple_name}`. | |
""") | |
thresh_sel = st.slider('Select a similarity threshold', | |
min_value=20, max_value=160, | |
value=80, step=20) | |
st.markdown("---") | |
df['size'] = sizes[0] | |
df.loc[scalar_filter, 'size'] = sizes[1] | |
df.loc[df.simple_name == match.simple_name, 'size'] = sizes[2] | |
scalar_fig = build_chart(df) | |
scalar_hits = df[scalar_filter]['name'].values | |
# ===== VECTOR SEARCH ===== | |
vector = match[['r', 'g', 'b']].values.tolist() | |
dist_metric = 'euc' | |
def get_dist(a, b, method): | |
if method == 'euc': | |
return np.linalg.norm(a-b) | |
else: | |
return spatial.distance.cosine(a, b) | |
df['dist'] = df[['r', 'g', 'b']].apply(lambda x: get_dist(x, vector, dist_metric), axis=1) | |
df['size'] = sizes[0] | |
if dist_metric == 'euc': | |
vec_filter = df['dist'] < thresh_sel | |
else: | |
vec_filter = df['dist'] < 0.05 | |
df.loc[vec_filter, 'size'] = sizes[1] | |
df.loc[((df['r'] == vector[0]) & | |
(df['g'] == vector[1]) & | |
(df['b'] == vector[2]) | |
), | |
'size'] = sizes[2] | |
vector_fig = build_chart(df) | |
vector_hits = df[vec_filter].sort_values('dist')['name'].values | |
# ===== OUTPUTS ===== | |
col1, col2 = st.columns(2) | |
with col1: | |
st.markdown(f"These colors contain the text: `{match.simple_name}`:") | |
st.plotly_chart(scalar_fig, use_container_width=True) | |
st.markdown(f"Found {len(scalar_hits)} colors containing the string `{query}`.") | |
with st.expander(f"Click to see the whole list"): | |
st.markdown("- " + "\n- ".join(scalar_hits)) | |
with col2: | |
st.markdown(f"These colors are close to the vector `({match.r}, {match.g}, {match.b})`:") | |
st.plotly_chart(vector_fig, use_container_width=True) | |
st.markdown(f"Found {len(vector_hits)} colors similar to `{query}` based on its `(R, G, B)` values.") | |
with st.expander(f"Click to see the whole list"): | |
st.markdown("- " + "\n- ".join(vector_hits)) | |
# ===== REFLECTIONS ===== | |
unique_hits = [c for c in vector_hits if c not in scalar_hits] | |
st.markdown("---") | |
st.header("So what?") | |
st.markdown(""" | |
What did you notice? | |
The thing that stood out to me is how *robust* and *consistent* | |
the vector search results are. | |
It manages to find a bunch of related colors | |
regardless of what it's called. It doesn't matter that the color | |
'scarlet' does not contain the word 'red'; | |
it goes ahead and finds all the neighboring colors based on a consistent criterion. | |
It easily found these colors which it otherwise would not have based on the name alone: | |
""") | |
with st.expander(f"See list:"): | |
st.markdown("- " + "\n- ".join(unique_hits)) | |
st.markdown(""" | |
I think it's brilliant - think about how much of a pain word searching is, | |
and how inconsistent it is. This has so many advantages! | |
--- | |
""") | |
st.header("Generally speaking...") | |
st.markdown(""" | |
Obviously, this is a pretty simple, self-contained example. | |
Colors are particularly suited for representing using just a few | |
numbers, like our primary colors. One number represents how much | |
`red` each color contains, another for `green`, and the last for `blue`. | |
But that core concept of representing similarity along different | |
properties using numbers is exactly what happens in other domains. | |
The only differences are in *how many* numbers are used, and what | |
they represent. For example, words or documents might be represented by | |
hundreds (e.g. 300 or 768) of AI-derived numbers. | |
We'll take a look at those examples as well later on. | |
Techniques used to visualise those high-dimensional vectors are called | |
dimensionality reduction techniques. If you would like to see this in action, check out | |
[this app](https://huggingface.co/spaces/jphwang/reduce_dimensions). | |
""") | |
st.markdown(""" | |
--- | |
If you liked this - [follow me (@_jphwang) on Twitter](https://twitter.com/_jphwang)! | |
""") | |
if __name__ == '__main__': | |
main() | |