Now that you know what a bag-of-words vector looks like, you can create a function that builds them!

First, we need a way of generating a features dictionary from a list of training documents. We can build a Python function to do that for us…



Define a function create_features_dictionary() that takes one argument, documents. This will be the list of string documents that we pass in (like ["All the cool fish love to fly high.", "Nobody knows why the fish fly so high.", "Those cool fish sure are spry."]).

Inside the function, set features_dictionary equal to an empty dictionary. This is where we’ll map all of our terms to index numbers. For now, return features_dictionary from the function.


Above the return statement, merge the documents into a string joined together by spaces and assign the result to merged.

Now that the documents are all in a single string, call preprocess_text() on merged and assign the result to tokens. Return tokens from the function in addition to features_dictionary.


Above the return statement, assign index a value of 0. This will correspond to the first word’s vector index.


The words are prepared, the empty dictionary is prepared, and we have an index number we can use; it’s time to get the words into the dictionary and link each to a vector index number!

  • Above the return, loop through each token in tokens.
  • In the loop, check if token is NOT in features_dictionary.
  • If it’s a new word, add token as a key to features_dictionary with a value of index.

After adding token to features_dictionary, increment index by 1 so that each new word has its own index.


Uncomment the print statement to test out the function!

Sign up to start coding

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Already have an account?