LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Exercise 9: Exploring POS with the Brown corpus

Answer the following questions about the POS-tagged "mystery" portion of the Brown Corpus, part of NLTK's corpus collection. Start by loading the tagged words and the tagged sentences, as follows:

>>> br_tw = nltk.corpus.brown.tagged_words(categories='mystery')
>>> br_ts = nltk.corpus.brown.tagged_sents(categories='mystery')
>>>

Note that this gives us the default Brown Corpus tagset. Rules: (1) You should only use the two data objects (i.e., do not use nltk.corpus.brown.words(), etc.), (2) Use of for loops is forbidden! Use list comprehension. (3) When citing a POS, make sure to give its description ('possessive pronoun') along with its label ('PP$$').

How many words and sentences are there?
What are the top 10 POS tags and their counts?
Extract a list of POS tags from br_tw using list comprehension, and build a FreqDist object from it. Then, call .most_common(10) on it. See the example right below Table 2.1 in the book for illustration. brown_news_tagged in the example is our br_tw.
What are the top 10 words and their counts?
Do exactly what you did for the previous question but adapt it for words instead of tags.
How many different POS tags are represented in this Brown category? Why are there more than 87?
Use len() and .keys() on the FreqDist object you built for a. above. Alternatively, you can use set().
What is the most frequent adverb?
Option 1: use list comprehension to build a list of words with 'RB' tag, and then use nltk.FreqDist.
Option 2: build nltk.ConditionalFreqDist of all Brown tagged words, with TAGS, not words, as conditions. You will need to build a list of (TAG, WORD) tuples first.
Consider the word 'so'. How many tokens are there? Note that you should include both 'So' and 'so'.
'So' is ambiguous: it can take on three distinct parts-of-speech. What are they, and which one is most frequent?
List comprehension, and then build a FreqDist. OR, build a ConditoinalFreqDist object with words as conditions.
For each of the three parts-of-speech for 'so', give an example sentence where it is used as the POS.
This time, operate on br_ts, the list of tagged sentences.
For each of the 3 'so's possible parts-of-speech, find out:
- the most likely POS preceding 'so' as the POS
- the most likely POS following 'so' as the POS
This one is more involved. There are many different ways to achieve this, some methods simpler than others. One way is to use the for loop over the indices of br_tw (I'll allow for loop this time). A better solution involves constructing bigrams or trigrams of tagged words and processing them. In either approach, FreqDist and/or ConditionalFreqDist come in handy. The book has a similar example. Go ahead and give it a try -- we will work on this together in class.

SUBMIT:

Your saved IDLE shell session, with errors and other long bits removed, edited to clearly mark the answers to questions PART1, a. b. c. d. etc. Feel free to insert any comments you might have.
If you prefer, you can submit an MS Word Doc "answer sheet": a Word document with your screenshot of IDLE shell pasted in plus your written answers, like we did for previous assignments (example).