The Machine Learning Process: Look What Taylor Swift Made Me Do

For the past few months, the Curriculum team at Codecademy has been hard at work creating Machine Learning courses. While we all loved writing courses, we also wanted to see what we could do with real-world data. As a result, we challenged each other to find a use for machine learning in a topic that we were passionate about. For me, that’s music.

It’s said that popular music is a reflection of society, a barometer for our collective wants, fears, and emotional states. Others are of the belief that music is more a reflection of the artist, a diary that’s been flung from the nightstand drawer into the media frenzy of our modern world. With either case, music can serve as an insight into the human mind in ways many other mediums cannot.

One tool we can use to dig further into lyric-based music is natural language processing, or NLP, a subset of artificial intelligence devoted to the analysis of language. I wanted to use NLP to analyze the body of work of a popular artist with an intriguing history: Taylor Swift. You can also check out another machine learning project analyzing Survivor confessionals.

Forming Your Question and Gathering Data

For my analysis, I worked with a Kaggle dataset containing all of Taylor Swift’s lyrics, from her 2006 eponymous album to her most recent release, 2017’s Reputation. As someone only familiar with her bigger hits, I was interested to learn more about Taylor’s progression as an artist and a person. What are the core themes she addresses, and how have they changed as she grew up from a teenage country sweetheart into an international pop sensation? And what are the deeper connections in the word choices she makes in her songs?

Cleaning the Data

The original dataset contains 4,862 rows of data, representing each individual line of lyric Taylor has sung, as well as the track title and album for each line. Since I was interested in analyzing themes on a song by song basis, I had to aggregate the lyrics up to a song level using Pandas. Transforming your data to the right level of granularity for the purposes of your analysis or machine learning project is a common task, and will often require some level of preprocessing and experimentation with Pandas to get just right.

One tricky aspect of NLP projects is that all texts analyzed will contain a variety of words that do not provide any meaningful information in terms of detecting underlying structure or themes. These words can be generally common words such as “I”, “me”, or “my”, as well as specific words that appear frequently in the entire collection of text that is being studied, known as the corpus. For Taylor’s songs, these can be words such as “oh” and “yeah”. In industry, these words are called stop words, and removing them from our corpus before analysis is helpful in receiving better results!

Making the Model

To understand the thematic changes in Taylor’s music over time, I decided to build a topic model based on her song lyrics. Topic Modeling is a process by which we find latent, or hidden, topics in a series of documents. What does this mean? It means that by looking at a series of documents, in this case, the songs in Taylor’s discography, we can find sets of words that often co-occur, forming cohesive ‘topics’ that are prevalent in certain songs from throughout her career. Once we define these topics and the words that compose them, we can then track how prevalent these topics are over time, indicating tonal shifts in Taylor’s music and thus reflecting back on her life.

The first step to building a topic model is to extract features from the corpus to model off of. In NLP a common technique for creating a features space is the bag-of-words model. A bag-of-words model totals the frequencies of each word in a document, with each unique word being its own feature and its frequency being the value. An even better means of feature extraction that digs a bit deeper is tf-idf, or term frequency-inverse document frequency. This method penalizes the word counts for words that appear very often in a corpus, since they should be less likely to provide insight into the specific topics of a document given how common the word is. Using tf-idf, we are able to have features for each song that can represent how important each word is to that song.

tfidf_vectorizer = TfidfVectorizer(stop_words = list_of_stop_words, min_df = 0.1) tfidf_freq = tfidf_vectorizer.fit_transform(documents)

Now that I had my features, I could go create a topic model! The modeling technique I chose for this project is NMF. NMF, or non-negative matrix factorization, is an algorithm that we can use to pull out our topics, or co-occurring word groupings, and how prevalent these topics are across each song in Taylor’s discography.

The first part of the output from NMF are the words that make up each topic. At this point in the ML process, we get to use those creative juices! By checking the top 10 words in each topic, you can make an executive decision about what topic or idea is represented by the words.

Words per topic

Based on the words and my knowledge of Taylor Swift, I came up with the topics below.

topic_labels = ['Love and Beauty', 'Growing Up', 'Home', 'Bad/Remorse', 'Contemplative', 'Dancing', ]

The other piece of output from NMF is a document-topic matrix. In this matrix every row is a song, every column is a topic, and the value is a relative score of how much the topic exists in a specific song. I wanted to see how often each topic appears in each of Taylor’s songs, but I also wanted to set a threshold for how high a topic score needs to be in order for a song to be labeled with that topic. After playing around with the topic score threshold, I decided to set the threshold at 0.1.

Now that I had this transformed matrix of topics and songs, I was able to focus my analysis to each album/year. After grouping by year and summing the count of songs across each topic, I had what I was looking for: the number of songs in each topic per album!

album_topics = doc_tops.groupby('album')['Love and Beauty','Growing Up','Home','Bad/Remorse','Contemplative', 'Dancing'].sum().reset_index()

Presenting the Results

From the topic count of songs per album, I was able to construct the topics over time graph below.

Topics over time

When it came to interpreting and validating my results, I referred to Codecademy’s in-house Taylor Swift expert, my colleague Laura. With Laura’s deep knowledge of both Taylor’s catalog and life, we were able to match the changes in topic density over time seen above with the events of Taylor’s life.

As Taylor progresses from a country artist into a pop artist, we see an increase in the content of songs related to dancing. When Taylor moves to New York from Nashville in 2010, we see her greatest prevalence of discussing ideas related to growing up. Throughout her career, we see an interesting fight between the topics of Love and Beauty versus Bad/Remorse. With each new romance and subsequent heartbreak Taylor experiences, these topics continue to be common and at war in terms of dominating her music. 2012 brought Taylor’s arguably best and most popular album, Red, and with it the greatest number of songs on an album with the topic of Love and Beauty. Other interesting highlights include a spike in the negative topic of Bad/Remorse during 2014, the time when Taylor was infamously at odds with Kanye West. In 2017, Taylor seems to be less contemplative than before, indicating a greater sense of self and confidence.

For all its greatness, topic modeling is not perfect. According to Laura, during Taylor’s Reputation era, she was experiencing the greatest level of love, beauty, and acceptance in her personal life. This doesn’t seem to match up with what our topic model says, but perhaps some finer tuning is needed.

Making a New Model

Now that I had my complete topic model, I wanted to create a second model that looked at the deeper relationship between individual words rather than the overarching topics of Taylor’s songs. To do this I used a modeling technique called word2vec. With this model, we can map each word that appears in Taylor’s lyrics to a 100-dimensional vector space, where semantically similar words are mapped to nearby points. We can then look at the similarity of words by comparing the distance between their mapped points. This mapping of a word to a vector space is called a word embedding. With this kind of model, we are able to see how similar certain words are to each other with respect to Taylor’s songwriting style.

Given these word embeddings, I wanted to find a way to visualize which words are related and which do not show a connection. Step in everyone’s favorite high dimensional visualization tool, t-SNE! t-SNE, or t-Distributed Stochastic Neighbor Embedding, is a dimensionality reduction technique designed for visualizing higher dimensional data in a 2-D space. What does a 100-dimensional word vector look like? It’s hard to tell! I could spend hours researching the intricacies of higher dimensional spaces, but finding a way to represent this data in 2-D can be much more useful for a simple-minded human like me to make some insights. By putting the Taylor specific word embeddings into t-SNE, we can explore Taylor’s syntactic decisions!

Presenting the Results

With the t-SNE below we can get an idea of how closely related words are, in terms of Taylor’s syntactical choices, by observing the words that cluster together. Remember, the words are clustered based on their embeddings’ similarities.

t-SNE plot of word embeddings

Let’s zoom in on a cluster from the bottom right to see what is going on.

Zoomed in t-SNE plot of word embeddings

Now you might not claim to be a fan of Taylor Swift, but most would be sure to recognize this clustering of words to come from Taylor’s song “We Are Never Ever Getting Back Together”. C’mon, you know this one!

Other interesting clusterings include on the bottom left “sing” and “loving”, suggesting Taylor’s affinity for the talent that has brought her fame, as well as “bad” and “blood” bottom center, alluding to her and Kendrick Lamar’s song “Bad Blood”.

Besides clusterings, we can presume that words on their own are used in unique ways compared to the rest of Taylor’s diction. Both “sad” and “heart”, more isolated in the t-SNE than most terms, popped out to me as provoking words that Taylor seems to use in her songwriting uniquely and with great intent.

Further Work

The analysis done here is just the start of all the cool things I could do with this data set. Given my topic model, I could create a recommendation engine to help listeners discover new Taylor Swift songs based on their favorites. I could also dig deeper into the syntax of Taylor’s lyrics, performing a grammatical analysis and then create a song generator to make my own Taylor Swift lyrics! The fruits of NLP are endless, and the insights it can provide give us deeper understanding of who we are as a communicative species. What will you find out with NLP?