Go to: LING 1330/2330 home page  

Homework Assignment 4: Who Said It?

This homework assignment has two parts. It is a larger-than-usual, week-long assignment that carries a total score of 60.

Who Said It?

Can you guess which author wrote the following lines: Jane Austen or Herman Melville?
I never met with a disposition more truly amiable.
But Queequeg, do you see, was a creature in the transition stage -- neither caterpillar nor butterfly.
I'll bet you can, but can a computer algorithm?

Let's find out. We will build a classifier which, given a sentence as input, predicts which of the two authors wrote it. We will be using Austen's Emma and Melville's Moby Dick as our corpus data, included in NLTK's Gutenberg corpus. There are two parts in this experiment: [A] developing a classifier, and [B] analysis and write-up.

Before you begin

For this homework, we will be using a couple of classifier methods that do not come with NLTK's original implementation of Naive Bayes. You will therefore need to update your NLTK's NaiveBayesClassifier module. Follow the instructions below.
  1. Find your local copy of naivebayes.py, found in the classify directory in your NLTK directory. Rename it naivebayes.py.ORIGINAL just in case you need it later. (NOTE: You may run into permission issues. Mac users: Follow the instructions in this FAQ.)
  2. Download this new naivebayes.py file: naivebayes.py and copy it into the directory. Supply administrator password if necessary.
  3. Quit and restart your Python. Import NLTK.
  4. Verify that the update was successful by checking if these two new commands are recognized:
     
    >>> import nltk
    >>> help(nltk.classify.NaiveBayesClassifier.feature_weights)
    Help on function feature_weights in module nltk.classify.naivebayes:
    
    feature_weights(self, fname, fval)
        Displays per label the probability of the feature+value.
        Added By Na-Rae Han, October 2023
    
    >>> help(nltk.classify.NaiveBayesClassifier.show_most_informative_feats_all)
    Help on function show_most_informative_feats_all in module nltk.classify.naivebayes:
    
    show_most_informative_feats_all(self, n=10)
        Displays top n most informative features. 
        Unlike show_most_informative_features, includes odds ratio of
    

[PART A] Developing a Naive Bayes Classifier

In this part, you will be developing a Naive Bayes classifier. Start with this template script: hw4_whosaid.TEMPLATE.py. Before you begin editing, execute the script and see what it prints out. Your job is to modify the script so it proceeds through the following steps:
  1. Load each text as a list of sentences (already COMPLETE in template script)
  2. From the two author-segregated lists, build two new lists where each sentence is labeled with its author, while weeding out sentences that are too short (1-2 words in length, e.g., 'Chapter XIII') to be useful
  3. Join the two labeled author-segregated lists into a single list
  4. Print out how many Austen, Melville, and total sentences there are (COMPLETE)
  5. Shuffle the labeled sentence list you made in STEP 3
  6. Partition sentences into three data sets: testing (first 1,000 sentences), development-testing (next 1,000), and training (the rest); and then, print out the size of each set.
  7. Define a feature-generator function (COMPLETE)
  8. Using the feature-generator function, convert the three data sets into their corresponding feature representations
  9. Using the training set, construct a Naive Bayes classifier named whosaid
  10. Using the test set, evaluate the classifier's performance
  11. From the development test set, create 4 subsets based on the real author vs. the classifier's prediction
  12. Print out sample correct and incorrect predictions from the 4 sub-divided sets
  13. Finally, print out 40 most informative features. Use two methods: first the usual .show_most_informative_features(), and then .show_most_informative_feats_all() which is new.

That concludes your classifier development. Note that every time you run your script you will get slightly different performance scores: that's because you are randomly shuffling your data set every time, resulting in a different "test -- development test -- training set" partition.

That brings us to one last step I want you to take. Replace your 5. sentence-shuffling command with the following line:

random.Random(10).shuffle(sents)
and re-run the script. This shuffles the sentence list based on a fixed random seed. The result is a list of sentences that have been randomly shuffled but nevertheless *in the same sequential order for all of us*, which will then lead to identical classifier models for everyone! This effectively freezes your/our model, and it allows us to share the same reference points for the next part of the homework, which centers on analysis. (And, obviously, it will make grading feasible at all...)


[PART B] Analysis and Write-Up

In this part, you will explore the classifier you built in PART A in order to gain an understanding of its inner workings. Instructions:
  • You should primarily work in a Python shell environment immediately following an execution of your script.
  • Answer the following questions in this MS Word document: HW4 Who Said It - PART B Analysis.docx. For most questions, you will be entering two things: (1) a screenshot of your IDLE shell showing relevant code bits and output, and (2) your written analysis accompanying the code. Don't just point to the code output and skip your commentary! YOUR ANALYSIS is the focus here, not the numbers or calculations. See this example to get a sense of what's expected.
  • As usual, when you're done save your IDLE shell session as a text file: hw4_shell.txt. You'll be submitting it as well. Feel free to post-edit to remove long, unnecessary bits.
Let's get to it!
  1. Classifier accuracy
    What is the system's accuracy? Is it lower or higher than you expected?
  2. Features
    1. Examine the gen_feats() function. What sort of features is used in this classifier model?
    2. Examine the first most informative features list and make observations. Do you notice any patterns? Any surprising entries?
    3. Examine the second most informative features list. How is this list different from the first one? Any additional observations? (NOTE: don't get confused! You are still dealing with the SAME model, it's just that the two methods show a slightly different set of top features.) (NOTE 2: If you are unclear about this one, revisit after you've worked on Q10 below.)
  3. Main character names
    Some of you are probably thinking: the classifier must be getting a lot of help from the main character names such as Emma, Ahab and Queequeg. Let's see how well it does without them.
    1. The script already contains a switch that you can turn on to "neutralize" the top 35 most common character names and place names in the two novels by turning them all into 'MontyPython'. Edit the file and set the value of noCharNames to True.
    2. Re-run the script. How is the new classifier's performance? Did it degrade as much as you expected? Why do you think that is? How is the top feature list affected?
    3. When you're done, set noCharNames back to False and re-build your classifier by running the script again. For the rest of this homework, USE THIS ORIGINAL SETTING.
  4. Trying out sentences
    Test the classifier on the two sentences below. Sent1 is actually by Jane Austen, taken from Persuasion. Sent2 is from Alice's Adventures in Wonderland by Lewis Caroll.
    (Sent1) Anne was to leave them on the morrow, an event which they all dreaded.
    (Sent2) So Alice began telling them her adventures from the time when she first saw the White Rabbit.
    1. What label did the classifier give to Sent1 and Sent2? Did it match your expectation?
  5. Label probabilities for a sentence
    Labeling judgments aside, how likely does your model thinks that Sent1 is Austen? That is essentially P(austen|Sent1). To find out, we need to use the .prob_classify method instead of the usual .classify. Below demonstrates how to find the probability estimates assigned to either label for the sentence 'Hello, world'. whosaid thinks it's 72% Melville and 28% Austen:
     
    >>> hellofeats
    {'contains-hello': 1, 'contains-,': 1, 'contains-world': 1}
    >>> whosaid.prob_classify(hellofeats).prob('austen')
    0.27993161657937693
    >>> whosaid.prob_classify(hellofeats).prob('melville')
    0.720068383420623
    
    1. Try it with Sent1. What is P(austen|Sent1)? That is, given Sent1, how likely is it to be Austen? What is P(melville|Sent1)?
    2. How about Sent2 -- P(austen|Sent2) and P(melville|Sent2)?
    3. From a. and b., how "confident" is your classifier about Sent1 being Austen? Is it equally confident on Sent2 being Melville?
  6. Trying out made-up sentences
    Now, test the classifier on the following made-up sentences:
    (Sent3) He knows the truth
    (Sent4) She knows the truth
    (Sent5) blahblahblah blahblah
    1. What labels did the classifier give to Sent3 and Sent4, and with what probabilities? Any thoughts?
    2. What about Sent5? Given that neither "word" appeared in the training data, why do you think the classifier made the prediction it did?
  7. Base probabilities (=priors)
    Not knowing anything about a sentence, is it more likely to be Austen or Melville? We can answer this question by establishing the base probabilities, i.e. priors. In your training data (i.e., train_sents or train_feats):
    1. How many sentences are there?
    2. How many of them are Austen?
    3. How many of them are Melville?
    4. From the above, what are P(austen) and P(melville)?
    5. How is your answer to d. related to the classifier's prediction on Sent5 above?
  8. Calculating odds ratio
    Would the word 'very' be more indicative of Austen or Melville, and how strongly so? Let's answer this by calculating its odds ratio. Find out the following, again in the training data:
    1. How many Austen sentences contain 'very'? Make sure to count 'Very' as well.
    2. How about Melville sentences?
    3. What is P(very|austen)? That is, given an Austen sentence, how likely is it to contain 'very'?
    4. What is P(very|melville)? That is, given a Melville sentence, how likely is it to contain 'very'?
    5. What is Austen-to-Melville odds ratio of 'very'?
  9. Feature weights in model
    P(very|austen) and P(very|melville) are indeed the 'weights' your model assigns to the feature 'contains-very':1. Let's confirm this by probing your model. Use the .feature_weights() method:
     
    >>> whosaid.feature_weights('contains-she', 1)
    {'austen': 0.2079274689045407, 'melville': 0.011496285815351963}
    
    This means that 1 out of 5 Austen sentences contain 'she'. For Melville, the likelihood is much lower: 1 out of 100 sentences. (For fun, try with 'he'.)
    1. What are the weights of 'very'?
    2. Do they match up with what you calculated in 8.c and 8.d above? (They should. The small differences are effects of smoothing, which may be more pronounced in other cases.)
  10. Zero counts and feature weights
    In order to accommodate features and feature-value pairs never encountered in the training data, a machine learning algorithm will adopt a couple of strategies, including smoothing.
    1. Look up the feature weights of 'contains-whale' and also 'contains-ahab'. What do you notice?
    2. This time, look up the feature weights for words 'housekeeper' and 'Emma'. Anything noticeable?
    3. Find a word that occurs in Austen's work only, and another that occurs only in Melville, and then look up their feature weights. You should have a theory by now -- sum up what is going on with these words and their feature weights.
    4. As a comparison point, the word 'enchanting' occurs exactly once in the Austen training sentences, and likewise in the Melville training sentences. How do its feature weights compare against the ones you saw above?
    5. Now, try 'contains-internet'. What happens this time?
    6. Then, using the .prob_classify method, find out the likelihood of 'She hates the internet' being an Austen sentence. Then try 'She hates the'. What can you conclude about the classifier's handling of features it never encountered in the training data?
  11. Combining feature weights
    (**NOTE: Do not use rounding for the calculations below! Numbers will fail to match up if you do.)
    So, how is an overall probability estimation for a given sentence being, say, Austen obtained from individual feature weights? That is, how is P(austen|sent) obtained? Recall from class that:
        P(label|sent) = P(sent,label) / P(sent)
    Let's focus on the label being Austen for now:
        P(austen|sent) = P(sent,austen) / P(sent)    
    Note that P(sent, austen) + P(sent, melville) equals the probability of the sentence occurring at all, i.e., P(sent). Therefore:
        P(austen|sent) = P(sent,austen) / P(sent)
                       = P(sent,austen) / ( P(sent,austen) + P(sent,melville) )
    So this means we need to know P(sent,austen) and P(sent,melville). A corollary of is P(sent|austen) = P(sent,austen) / P(austen). (Note that P(sent,austen) = P(austen,sent).) Therefore, we can estimate P(sent, austen) as follows:
        P(sent,austen) = P(austen) * P(sent|austen)
                       = P(austen) * P(w1|austen) * P(w2|austen) * ... * P(wn|austen)
    That is, the probability of a sentence occurring and it being Austen equals the base probability of the label P(austen) multiplied by the probability of each word feature in the sentence occurring in an Austen sentence. Then, the probability of Sent3 'He knows the truth' occurring as an Austen sentence is:
        P(Sent3,austen) 
              = P(austen) * P(he|austen) * P(knows|austen) * P(the|austen) * P(truth|austen) 
    1. You already calculated the Austen prior: P(austen) in 7 d. What is it?
    2. P(he|austen) can be found through whosaid.feature_weights('contains-he', 1). Likewise with the rest of the words.
    3. From a. and b., calculate P(Sent3, austen).
    4. Similarly, calculate P(Sent3, melville).
    5. Now, calculate P(Sent3) as P(Sent3, austen) + P(Sent3, melville).
    6. Ultimately, the probability question we like to answer is: "Given the sentence He knows the truth, how likely is it to be Austen"? That is, what is P(austen|Sent3)? Use formula above to calculate this.
    7. Does the figure match up with the classifier's estimation from 6.a above? (It should.)
  12. Performance on the development-test data
    Work with the four lists aa, mm, am, ma to answer the following questions.
    1. Of the 1,000 development-test set, how many of them did the classifier correctly label?
    2. What is whosaid's accuracy on the development test data? Ideally, it should be close to its performance on the test data -- is it?
    3. What % of the sentences did the classifier label as 'austen'? How about 'melville'? Why do you think it is not 50-50?
    4. What % of the classifier's 'austen' rulings is correct? That is, when the classifier labels a sentence as 'Austen', what is the likelihood of this prediction to be correct?
    5. Likewise, when the classifier labels a sentence as 'melville', what is the likelihood of this prediction being correct?
  13. Error analysis
    The list am contains all sentences from the dev-test set that are in fact Austen but were mis-labeled as Melville by whosaid. Let's take a look at these errors.
    1. Print out all mis-classified Austen sentences by:
       
      >>> for x in am: print(' '.join(x[2]))
      
      What do you think of these sentences? Do they sound Melville-like to you?
    2. When a classifier mislabels, it is hoped that it at least did so with low confidence (say, 55%) than high (98%). Pick some sentence from the list and see what likelihood whosaid assigned to them for being Melville. Of all sentences you tried, which was judged Melville with the lowest confidence? Which one was it most sure about being Melville?


This is a big assignment with 60 total points (Part A: 20, Part B: 40), so you have a full week to finish. But to help you NOT procrastinate, I am building in a checkpoint. The homework submission link is configured to accept multiple submissions, so use the same link for both.

10/5 (Thu) Check your PART A output
  • In the text box, report your classifier's accuracy (Question 10) after you fixed the random seed. Don't round up the number! After your submission, you will be able to see if your figure is correct. If incorrect, there's something wrong with your script. Rework your script and submit a new answer.
  • That's it! The point was to make sure you are on the right track. No need to submit your script at this time.
10/10 (Tue) SUBMIT completed homework

For PART A, Upload your finished script: "hw4_whosaid.py"

For PART B, Upload:

  • Analysis answer sheet: "HW4 Who Said It - PART B Analysis.docx"
  • Your saved IDLE shell (you can edit to trim errors, etc.): "hw4_shell.txt"
Advanced Pythonians: You may submit a single Jupyter Notebook file (.ipynb) instead of all three. See the "ADVANCED PYTHONIANS" box on top for a template.