Go to: LING 1330/2330 home page  

Exercise 9: Exploring POS with the Brown corpus

Answer the following questions about the POS-tagged "mystery" portion of the Brown Corpus, part of NLTK's corpus collection. Start by loading the tagged words and the tagged sentences, as follows:
 
>>> br_tw = nltk.corpus.brown.tagged_words(categories='mystery')
>>> br_ts = nltk.corpus.brown.tagged_sents(categories='mystery')
>>> 
Note that this gives us the default Brown Corpus tagset. Rules: (1) You should only use the two data objects (i.e., do not use nltk.corpus.brown.words(), etc.), (2) Use of for loops is forbidden! Use list comprehension. (3) When citing a POS, make sure to give its description ('possessive pronoun') along with its label ('PP$$').

  1. How many words and sentences are there?
  2. What are the top 10 POS tags and their counts?
  3. What are the top 10 words and their counts?
  4. How many different POS tags are represented in this Brown category? Why are there more than 87?
  5. What is the most frequent adverb?
  6. Consider the word 'so'. How many tokens are there? Note that you should include both 'So' and 'so'.
  7. 'So' is ambiguous: it can take on three distinct parts-of-speech. What are they, and which one is most frequent?
  8. For each of the three parts-of-speech for 'so', give an example sentence where it is used as the POS.
  9. For each of the 3 'so's possible parts-of-speech, find out:
    • the most likely POS preceding 'so' as the POS
    • the most likely POS following 'so' as the POS


SUBMIT:
  • Your saved IDLE shell session, with errors and other long bits removed, edited to clearly mark the answers to questions PART1, a. b. c. d. etc. Feel free to insert any comments you might have.
  • If you prefer, you can submit an MS Word Doc "answer sheet": a Word document with your screenshot of IDLE shell pasted in plus your written answers, like we did for previous assignments (example).