Amazing work! As is the case with many tasks in Python, there’s already a library that can do all of that work for you.

For text_to_bow(), you can approximate the functionality with the collections module’s Counter() function:

from collections import Counter tokens = ['another', 'five', 'fish', 'find', 'another', 'faraway', 'fish'] print(Counter(tokens)) # Counter({'fish': 2, 'another': 2, 'find': 1, 'five': 1, 'faraway': 1})

For vectorization, you can use CountVectorizer from the machine learning library scikit-learn. You can use fit() to train the features dictionary and then transform() to transform text into a vector:

from sklearn.feature_extraction.text import CountVectorizer training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"] test_text = ["Another five fish find another faraway fish."] bow_vectorizer = CountVectorizer() bow_vectorizer.fit(training_documents) bow_vector = bow_vectorizer.transform(test_text) print(bow_vector.toarray()) # [[2 0 1 1 2 1 0 0 0 0 0 0 0 0 0]]



Now, let’s see how scikit-learn stacks up with the same bag-of-words functionality! Import CountVectorizer from sklearn. (Check out the example we gave for how to import CountVectorizer.)


Define bow_vectorizer as our vectorizer using CountVectorizer().


Define training_vectors as bow_vectorizer.fit_transform() called on training_docs.

fit_transform() does two things: creation of the features dictionary and the vectorization of the training data.

Define test_vectors as bow_vectorizer.transform() called on test_docs.


Uncomment the code at the bottom of script.py. Run the code again to see why it makes sense to use sklearn‘s optimized functions!

Sign up to start coding

By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Already have an account?