LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Go to: LING 1330/2330 home page

Homework Assignment 8

Before You Begin

Work on a Python script for this homework. But as always, your best bet is to switch back and forth between your Python script and shell: execute your script, try out follow-up commands in the shell, update your script with successful code, and repeat.
The questions in this assignment should be answered directly by your script: have the script print out the answers and/or include them as comments.
Document your script by inserting appropriate comments. The script is going to get long, so it should help keep your code organized and also facilitate my grading.
You might not like the particular flavor of phrase structure grammar I used here. Well trust me, I can/would love to give you a much bigger, more detailed, and overall better grammar! But do stick to this version for the purpose of this assignment: your rules and trees should exactly match the examples given.
In the trees and rules, use lower-case ('the', 'he') for all words, even at the beginning of a sentence. The only exceptions are the proper names ('Homer', 'Marge', etc.). This simplifies grammar development and parsing.
For the same reason, disregard punctuation and symbols for this assignment.
For your reference, all the sentences and their tree drawings used in this assignment can be found on this page. Make sure the trees you build matches the tree representation on it.

PART A: Build a Context-Free Grammar (CFG) and a Parser [20 points]
Your job for this part is to develop a context-free grammar (CFG) and a chart parser (as shown in 8.1.2 Ubiquitous ambiguity) using the grammar.

Using NLTK's nltk.CFG.fromstring() method, build a CFG named grammar1. The grammar should cover all of the sentences below and their tree structure as presented on this page. The grammar's start symbol should be 'S': make sure that an S rule (ex. S -> NP VP) is the very top rule in your list of rules.
(s6) the big bully punched the tiny nerdy kid after school
(s7) he gave the book to his sister
(s8) he gave the book that I had given him t to his sister
*NOTE: t denotes a trace. For this homework, you may treat it like an ordinary lexical item.
(s9) Homer and Marge are poor but very happy
(s10) Homer and his friends from work drank and sang in the bar
(s11) Lisa told her brother that she liked peanut butter very much

Once a grammar is built, you can print it. Also, you can extract a set of production rules with the .productions() method. Unlike the .productions() method called on a Tree object, the resulting list should be duplicate-free. As before, each rule in the list is a production rule type. A rule has a left-hand side node (the parent node), which you can get to using the .lhs() method; the actual string label for the node can be accessed by calling .symbol() on the node object.

>>> print(grammar3) Grammar with 5 productions (start state = S) S -> NP VP NP -> N VP -> V N -> 'Homer' V -> 'sleeps' >>> grammar3.productions() [S -> NP VP, NP -> N, VP -> V, N -> 'Homer', V -> 'sleeps'] >>> last_rule = grammar3.productions()[-1] >>> last_rule V -> 'sleeps' >>> last_rule.is_lexical() True >>> last_rule.lhs() # returns a tree node object V >>> last_rule.lhs().symbol() # returns node label as a string 'V'

Explore the rules and answer the following questions.

What is the start state of your grammar?
How many CF rules are in your grammar? Is it 71? (It should be.)
How many of them are lexical?
How many VP rules are there? That is, how many rules have 'VP' on the left-hand side of the fule? That is, how many rules are of the VP -> ... form?
How many V rules are there? That is, how many rules have 'V' on the left-hand side of the fule? That is, how many rules are of the V -> ... form?

Using grammar1, build a chart parser. (Example shown in 8.1.2 Ubiquitous ambiguity.)
Using the parser, parse the sentences s6 -- s11. If your grammar1 is built correctly to cover all of the sentences, the parser should successfully parse all of them.

PART B: Refine your Grammar [20 points]
In this part, you will be updating the grammar and the parser you built in PART A.

Examine the parser output from PART A. Is any of the sentences ambiguous, that is, has more than one parse? Pick an example and provide an explanation.
Have your parser parse this new sentence. It is covered by the grammar, therefore the parser should be able to handle it:
(s12) Lisa and her friends told Marge that Homer punched the bully in the bar

Come up with a sentence of your own that's covered by grammar1 and have the parser parse it. Are you satisfied with the result?
Let's revisit our first three sentences from Exercise 10:
(s1) Marge will make a ham sandwich
(s2) will Marge make a ham sandwich
(s3) Homer ate the donut on the table
As it is, your grammar1 does not cover them. But we can extend it with the CF rules from the three sentences' trees. Follow the steps below.

From the three sentence trees, create a list of all production rules in them. Turn it into a set, which removes all duplicates. (Hint: use set().)
From it, create a new list called more_rules, which consists of CF rules from the three trees *that are not already in grammar1*.
Add the additional rules to your grammar1's production rules, using the .extend() method.
And then, you have to re-initialize the grammar using the extended production rules (highlighted part). An illustration:

>>> print(grammar3) Grammar with 5 productions (start state = S) S -> NP VP NP -> N VP -> V N -> 'Homer' V -> 'sleeps' >>> more_rules [V -> 'sings', V -> 'drinks'] >>> grammar3.productions().extend(more_rules) >>> grammar3 = nltk.grammar.CFG(grammar3.start(), grammar3.productions()) >>> print(grammar3) Grammar with 7 productions (start state = S) S -> NP VP NP -> N VP -> V N -> 'Homer' V -> 'sleeps' V -> 'sings' V -> 'drinks' >>>

Now, rebuild your chart parser with the updated grammar1. And try parsing the three sentences. It should successfully parse them.

Try parsing another sentence of your own that's covered by the newly extended grammar1. Are you satisfied with the result?
As the final step, pickle your grammar1 as hw8_grammar.pkl.

SUBMIT:

Your complete Python script for PART A and B.
Your Python script output, i.e., a saved IDLE shell session.
Your pickled file containing grammar1.