Learn
Term Frequency–Inverse Document Frequency
Converting Bag-of-Words to Tf-idf

In addition to directly calculating the tf-idf scores for a set of terms across a corpus, you can also convert a bag-of-words model you have already created into tf-idf scores.

Scikit-learn’s TfidfTransformer is up to the task of converting your bag-of-words model to tf-idf. You begin by initializing a TfidfTransformer object.

tf_idf_transformer = TfidfTransformer(norm=False)

Given a bag-of-words matrix count_matrix, you can now multiply the term frequencies by their inverse document frequency to get the tf-idf scores as follows:

tf_idf_scores = tfidf_transformer.fit_transform(count_matrix)

This is very similar to how we calculated inverse document frequency, except this time we are fitting and transforming the TfidfTransformer to the term frequencies/bag-of-words vectors rather than just fitting the TfidfTransformer to them.

Instructions

1.

Consider one last time the same selection of 6 Emily Dickinson poems given in poems.py. The term frequencies of each term-document pair are calculated in term_frequency.py and stored in bow_matrix as a matrix and df_bag_of_words as a Pandas DataFrame.

In script.py, print df_bag_of_words to view the bag-of-words matrix (term-document matrix of term frequencies).

2.

Initialize a TfidfTransformer object named transformer with keyword argument norm=None.

3.

Use transformer to fit and transform the bag-of-words matrix bow_matrix into tf-idf scores. Save your result to tfidf_scores.

Folder Icon

Sign up to start coding

Already have an account?