Skip to Content
Learn
Natural Language Parsing with Regular Expressions
Chunking Noun Phrases

While you are able to chunk any sequence of parts of speech that you like, there are certain types of chunking that are linguistically helpful for determining meaning and bias in a piece of text. One such type of chunking is NP-chunking, or noun phrase chunking. A noun phrase is a phrase that contains a noun and operates, as a unit, as a noun.

A popular form of noun phrase begins with a determiner DT, which specifies the noun being referenced, followed by any number of adjectives JJ, which describe the noun, and ends with a noun NN.

Consider the part-of-speech tagged sentence below:

[('we', 'PRP'), ('are', 'VBP'), ('so', 'RB'), ('grateful', 'JJ'), ('to', 'TO'), ('you', 'PRP'), ('for', 'IN'), ('having', 'VBG'), ('killed', 'VBN'), ('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'), ('of', 'IN'), ('the', 'DT'), ('east', 'NN'), (',', ','), ('and', 'CC'), ('for', 'IN'), ('setting', 'VBG'), ('our', 'PRP$'), ('people', 'NNS'), ('free', 'VBP'), ('from', 'IN'), ('bondage', 'NN'), ('.', '.')]

Can you spot the three noun phrases of the form described above? They are:

  • (('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'))
  • (('the', 'DT'), ('east', 'NN'))
  • (('bondage', 'NN'))

With the help of a regular expression defined chunk grammar, you can easily find all the non-overlapping noun phrases in a piece of text! Just like in normal regular expressions, you can use quantifiers to indicate how many of each part of speech you want to match.

The chunk grammar for a noun phrase can be written as follows:

chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
  • NP is the user-defined name of the chunk you are searching for. In this case NP stands for noun phrase
  • <DT> matches any determiner
  • ? is an optional quantifier, matching either 0 or 1 determiners
  • <JJ> matches any adjective
  • * is the Kleene star quantifier, matching 0 or more occurrences of an adjective
  • <NN> matches any noun, singular or plural

By finding all the NP-chunks in a text, you can perform a frequency analysis and identify important, recurring noun phrases. You can also use these NP-chunks as pseudo-topics and tag articles and documents by their highest count NP-chunks! Or perhaps your analysis has you looking at the adjective choices an author makes for different nouns.

It is ultimately up to you, with your knowledge of the text you are working with, to interpret the meaning and use-case of the NP-chunks and their frequency of occurrence.

Instructions

1.

Define a piece of chunk grammar named chunk_grammar that will chunk a noun phrase. Name the chunk NP.

2.

Create a RegexpParser object called chunk_parser using chunk_grammar as an argument.

3.

That part-of-speech tagged novel pos_tagged_oz you previously created has been imported for you in the workspace.

Create a for loop through each part-of-speech tagged sentence in pos_tagged_oz. Within the for loop, NP-chunk each part-of-speech tagged sentence using chunk_parser‘s .parse() method and append the result to np_chunked_oz. Each item in np_chunked_oz will now be a noun phrase chunked sentence from The Wonderful Wizard of Oz!

4.

A customized function np_chunk_counter that returns the 30 most common NP-chunks from a list of chunked sentences has been imported to the workspace for you. Call np_chunk_counter with np_chunked_oz as an argument and save the result to a variable named most_common_np_chunks.

Print most_common_np_chunks. What sticks out to you about the most common noun phrase chunks? Are you surprised by anything? Open the hint to see our analysis.

Want to see how np_chunk_counter works? Use the file navigator to open np_chunk_counter.py and inspect the function.

Folder Icon

Sign up to start coding

Already have an account?