Go to: LING 1330/2330 home page  

Exercise 6: Sentiment Analysis of Movie Reviews

Try out document classification on movie reviews by following this NLTK Book section. Since we are dealing with positive/negative opinions on movies, this task is a form of sentiment analysis. Details:
  1. Start out by exploring the movie reviews corpus and familiarizing yourself, which the book didn't do. You know how to do this! Figuring out how big the corpus is, how many reviews there are, and how many of them are positive/negative would be a bare minimum.
  2. Do NOT just blindly copy and paste the code from the book: try your best to explore and understand what's going on. You will notice the code is pretty dense with lots of list comprehension, but see if you can break them down to achieve better understanding. Also: don't shy away from flashing objects in shell. You're building lots of novel objects, and without poking them you won't understand what's going on.
  3. Because of random shuffling, your "most informative features" list might not look exactly like what's shown in the book. So, don't be alarmed if Mr. Matt Damon is missing from your list. Don't stop at 5; try 10, 20.
  4. If you're done with what's in the book, it's time to try something new. See how the classifier classifies this short and fake movie review.
     
    >>> myreview = """Mr. Matt Damon was outstanding, fantastic, excellent, wonderfully 
    subtle, superb, terrific, and memorable in his portrayal of Mulan."""   
    >>> myreview_toks = nltk.word_tokenize(myreview.lower())  # lowercase, and then tokenize
    >>> myreview_toks
    ['mr.', 'matt', 'damon', 'was', 'outstanding', ',', 'fantastic', ',', 'excellent', ',', 
    'wonderfully', 'subtle', ',', 'superb', ',', 'terrific', ',', 'and', 'memorable', 'in', 
    'his', 'portrayal', 'of', 'mulan', '.']
    >>> myreview_feats = document_features(myreview_toks)     # generate word feature dictionary
    >>> classifier.classify(myreview_feats)    # classify
                  ??              
    >>> classifier.prob_classify(myreview_feats).prob('pos')  # probability of 'pos' label
                  ??              
    >>> classifier.prob_classify(myreview_feats).prob('neg')  # probability of 'neg' label
                  ??              
    >>> 
    
  5. This time, change "Matt Damon" to "Steven Seagal" (IMDB profile) and see what happens.
  6. Now try this review, even shorter but still made up of positive words. Are you surprised by the result? (You should be.)
     
    >>> myreview = "Mr. Matt Damon was outstanding, fantastic."   
    
    Take a look at the explanation below; it will likely not make a whole lot of sense at this point, but try and keep this tidbit in your memory. We will revisit this after Homework 4.
  7. When you're done, save your IDLE shell session as a .txt file. Edit the file to clean out long and messy bits, and add your notes/comments.


SUBMIT:
  • A saved shell session as a .txt file, edited to include your comments