Learn
Word Embeddings
Distance

The key at the heart of word embeddings is distance. Before we explain why, let’s dive into how the distance between vectors can be measured.

There are a variety of ways to find the distance between vectors, and here we will cover three. The first is called Manhattan distance.

In Manhattan distance, also known as city block distance, distance is defined as the sum of the differences across each individual dimension of the vectors. Consider the vectors [1,2,3] and [2,4,6]. We can calculate the Manhattan distance between them as shown below:

$manhattan\ distance\ =\ \left | 1-2 \right |+\left | 2-4 \right| +\left | 3-6 \right|=1+2+3=6$

Another common distance metric is called the Euclidean distance, also known as straight line distance. With this distance metric, we take the square root of the sum of the squares of the differences in each dimension.

$euclidean\ distance\ =\sqrt{(1-2)^{2})+(2-4)^{2})+(3-6)^{2})}=\sqrt{14}\approx 3.74$

The final distance we will consider is the cosine distance. Cosine distance is concerned with the angle between two vectors, rather than by looking at the distance between the points, or ends, of the vectors. Two vectors that point in the same direction have no angle between them, and have a cosine distance of 0. Two vectors that point in opposite directions, on the other hand, have a cosine distance of 1. We would show you the calculation, but we don’t want to scare you away! For the mathematically adventurous, you can read up on the calculation here.

We can easily calculate the Manhattan, Euclidean, and cosine distances between vectors using helper functions from SciPy:

from scipy.spatial.distance import cityblock, euclidean, cosine

vector_a = np.array([1,2,3])
vector_b = np.array([2,4,6])

# Manhattan distance:
manhattan_d = cityblock(vector_a,vector_b) # 6

# Euclidean distance:
euclidean_d = euclidean(vector_a,vector_b) # 3.74

# Cosine distance:
cosine_d = cosine(vector_a,vector_b) # 0.0

When working with vectors that have a large number of dimensions, such as word embeddings, the distances calculated by Manhattan and Euclidean distance can become rather large. Thus, calculations using cosine distance are preferred!

### Instructions

1.

Provided in script.py are the three vectors from the previous exercise, happy_vec, sad_vec, and angry_vec. Use SciPy to compute the Manhattan distance for the following:

• between happy_vec and sad_vec, storing the result in a variable man_happy_sad
• between sad_vec and angry_vec, storing the result in a variable man_sad_angry

Print man_happy_sad and man_sad_angry to the terminal.

Which word embeddings are a greater distance apart according to Manhattan distance?

2.

Now use SciPy to compute the Euclidean distance between happy_vec and sad_vec, storing the result in a variable euc_happy_sad, as well as the Euclidean distance between sad_vec and angry_vec, storing the result in a variable euc_sad_angry.

Print euc_happy_sad and euc_sad_angry to the terminal.

Which word embeddings are a greater distance apart according to Euclidean distance?

3.

Next stop, cosine city! Use SciPy to compute the cosine distance between happy_vec and sad_vec, storing the result in a variable cos_happy_sad, as well as the cosine distance between sad_vec and angry_vec, storing the result in a variable cos_sad_angry.

Print cos_happy_sad and cos_sad_angry to the terminal.

Which word embeddings are further apart according to cosine distance? What else do you notice about the different distance metrics? Are the values similar between the different techniques on each pairing of vectors?