The idea behind word embeddings is a theory known as the distributional hypothesis. This hypothesis states that words that co-occur in the same contexts tend to have similar meanings. With word embeddings, we map words that exist with the same context to similar places in our vector space (math-speak for the area in which our vectors exist).
The numeric values that are assigned to the vector representation of a word are not important in their own right, but gather meaning from how similar or not words are to each other.
Thus the cosine distance between words with similar contexts will be small, and the cosine distance between words that have very different contexts will be large.
The literal values of a word’s embedding have no actual meaning. We gain value in word embeddings from comparing the different word vectors and seeing how similar or different they are. Encoded in these vectors, however, is latent information about how they are used.
In script.py we have loaded a list of the most common 1,000 words in the English language,
most_common_words, and their corresponding vector representations as word embeddings,
Inspect these lists by printing the word at index
most_common_words and its corresponding word embedding in
vector_list to the terminal.
Also given in script.py is a function
find_closest_words(). This function accepts the following as arguments:
- a list of words
- their corresponding vector representations
- a target word
The function returns the words that have the smallest cosine distance between their vector representations and the target word.
For example, we could find the most common words closest to “tree” like this:
closest_to_tree = find_closest_words(most_common_words, vector_list, "tree")
"food" as arguments and save the result to
close_to_food to the terminal to see the result.
Which words does the function return? Are you surprised?
vector_list as arguments, but this time change the last argument to
"summer". Save the result to
close_to_summer, and print
close_to_summer to the terminal.
Which words does the function return? Any surprises this time around?
Feel free to experiment by calling
find_closest_words() with a different target word and seeing what results you get!