LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Go to: LING 1330/2330 home page

Homework Assignment 5: Regex in Python

An answer sheet (MS Word doc) is provided for this homework: HW5 Regex in Python.docx. For most questions, you will be entering two things: (1) a screenshot of your IDLE shell showing relevant code bits and output, and (2) your written answer accompanying the code. See this example to get a sense of what's expected. You'll be submitting your saved IDLE shell session too, so make sure to save it.
PART 1: Steve Jobs Redux [15 points]
In this part, we will practice Python's re module on the Steve Jobs wikipedia article. Start out by copying and pasting the first few paragraphs we used in the previous exercise, and then use the re.findall() method:

>>> jobs = """Steven Paul Jobs (/dʒɒbz/; February 24, 1955 – October 5, 2011) was an American business magnate, industrial designer, investor, and ... ... Jobs was diagnosed with a pancreatic neuroendocrine tumor in 2003. He died of respiratory arrest related to the tumor at age 56 on October 5, 2011.""" >>> re.findall(r'computers?', jobs, re.IGNORECASE) ['computer', 'computer', 'computers', 'computer', 'computer', 'computers', 'computer', 'computer']

Try 2 of the regular expression matches from Exercise 7, using re.findall().
If you are using () in your regular expression, be mindful about whether or not you want it to result in a group capture. Remember: you can avoid group capture by using (?:).

Using re.sub(), replace all instances of capitalized words ('Jobs', 'Apple', etc.) with 'BUELLER'.
Using re.findall(), formulate a regular expression that matches all multi-word proper noun phrases ("Steve Wozniak", "The Walt Disney Company", etc.). They can be identified as a sequence of capitalized words. You may exclude unconventional capitalization patterns such as 'iPad'.
Should you exclude all-caps words such as "I" in "Apple I" and "OS" in "Mac OS"? Or include them? It's your decision.

Then, using re.sub(), replace all those matching instances with "<MULTIWORD-PNP>". You just performed a rudimentary form of a Named Entity Recognition (NER) task!
Do the same, but let's preserve the proper NP itself this time. Substitute each multi-word proper NP with itself sandwiched between the opening tag <MULTIWORD-PNP> and the closing tag </MULTIWORD-PNP>. So, for example, "Steven Paul Jobs" should be replaced with <MULTIWORD-PNP>Steven Paul Jobs</MULTIWORD-PNP>.
You'll need to use group capturing with () and then the \1 reference.

PART 2: Exploring Alice in Wonderland [25 points]
In this part, we will unleash regular expressions on Carroll's Alice's Adventures in Wonderland. The text is part of NLTK's Project Gutenberg Selections corpus, and you can load up the text as a single string through the .raw() method. The resulting atxt is a very long string.

>>> from nltk.corpus import gutenberg as gut >>> gut.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', ..., 'carroll-alice.txt', ... >>> atxt = gut.raw('carroll-alice.txt') >>> len(atxt) # number of characters in atxt string 144395 >>> atxt[:200] # initial 200 characters "[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once" >>>

We now find regular expression matches on this long string. First import re and then find all alphabetic words using the .findall() method below. Note that symbols ("[", "'" and ":") and digits ("1865") did not match, rendering awords as a list of alphabetic words only:

>>> import re >>> awords = re.findall(r'[A-Za-z]+', atxt) >>> len(awords) # number of alphabetic words 27335 >>> awords[:40] # initial 40 words ['Alice', 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', 'CHAPTER', 'I', 'Down', 'the', 'Rabbit', 'Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had'] >>>

Now, answer the questions below through composing your own appropriate regular expressions.

Produce a list of all-capitalized words such as 'WOW'. What's the regular expression to use? How many tokens are there?
Below, we are producing all cases of 'have been', 'have ...ed' expressions with one intervening word. What are the intervening words found?

>>> haveXbeen = re.findall(r'((have|has|had|having) (\w+) (been|\w+ed))', atxt)

Produce a list of 'so ...ly' phrases, that is, 'so' modifying a word ending in 'ly'. How many are there?
Produce a list of 'so ...ly' phrases, allowing up to one intervening word between 'so' and the '...ly' word. How many are there?
Again, be mindful about group capture when you are using "()" in your regular expression. Use (?:) instead if you want to avoid group capture.

Of the word tokens in awords, how many of them have 3 or more "vowel" characters in a row? "beautiful" would be an example. To find out, conduct list comprehension on awords, using re.search() method.
The '50 State names' example at the bottom of this page should show you how.

Conduct your own regular expression search. Explain what you wanted to find, what your regular expression is, and what the search turned up.
Next up, let's try re.search(). This method is best suited to sifting through multiple strings -- often 100s and 1000s of lines. We will first break up atxt into a list of sentence strings, replace new line characters ('\n') with a space, and we are ready to search through them. With re.search(), a regular expression is often used repeatedly, therefore compiling it is a must. Below, we are searching for sentences containing long (13+ characters) words.

>>> asents = nltk.sent_tokenize(atxt) >>> len(asents) # number of sentences in Alice 1625 >>> asents[14] # 15th sentence, has a line break in the middle "How brave they'll all think me at\nhome!" >>> asents[14].replace('\n', ' ') # replace with a space "How brave they'll all think me at home!" >>> asents = [s.replace('\n', ' ') for s in asents] # transform all sentences >>> asents[14] # sentences are a single line now "How brave they'll all think me at home!" >>> >>> myre = re.compile(r'\w{13,}') # 13+ char words >>> for s in asents: matchobj = myre.search(s) if matchobj: # considered True as long as matchobj is not None print(matchobj.group()+'\n'+s+'\n') Multiplication However, the Multiplication Table doesn't signify: let's try Geography. inquisitively The Mouse looked at her rather inquisitively, and seemed to her to wink with one of its little eyes, but it said nothing. ... >>> matched = [s for s in asents if myre.search(s)] # save matched sentences

We decide to find sentences with an all-caps quote, such as "The sign had 'DRINK ME' beautifully printed on it.". A regular expression like r"'([A-Z]+ )*[A-Z]+'" just might do the trick. Try it out, and answer the following questions.

How is the regex's precision? Meaning, are the captured sentences what we wanted to capture? Try your own modification aimed at improving precision, and report how it worked.
How is the regex's recall? Meaning, are many/few sentences we wanted to capture slipping through? How would you go about answering this question?

Try and find all WH questions: questions that start with a wh- word (who, what, where, etc.).

SUBMIT:

MS Word answer sheet: "HW5 Regex in Python.docx" (template linked on top)
Your saved IDLE shell (you can edit to trim errors, etc.): "HW5_shell.txt"

Homework Assignment 5: Regex in Python

PART 1: Steve Jobs Redux [15 points]

PART 2: Exploring Alice in Wonderland [25 points]