Try out document classification on movie reviews by following this NLTK Book section. Since we are dealing with positive/negative opinions on movies, this task is a form of sentiment analysis. Details:
Start out by exploring the movie reviews corpus and familiarizing yourself, which the book didn't do. You know how to do this! Figuring out how big the corpus is, how many reviews there are, and how many of them are positive/negative would be a bare minimum.
Use the usual corpus methods: .fileids(), .words(), .raw(). This particular corpus comes with categories, so you should also try: .categories(). You can list file IDs based on categories: movie_reviews.fileids('pos').
Do NOT just blindly copy and paste the code from the book: try your best to explore and understand what's going on. You will notice the code is pretty dense with lots of list comprehension, but see if you can break them down to achieve better understanding. Also: don't shy away from flashing objects in shell. You're building lots of novel objects, and without poking them you won't understand what's going on.
That nested list comprehension for building up documents is a head-scratcher. Let me unpack that for you: it is for-looping through the list of categories (a short one of just ['neg', 'pos']), and then for-looping through all file IDs belonging to the category, and then finally creating a tuple of (review tokens, category) which populates the documents list. Go ahead and flash documents[0] in IDLE shell. (If you get a 'Squeezed text...' message, double-clicking it will reveal the content.) You will see that the tuple consists of a pair (x,y) where x is a movie review in its tokenized form, and y is its category 'pos'/'neg'. This data object makes a review (represented as word tokens) and its label more easily accessible for the upcoming feature generation step.
Because of random shuffling, your "most informative features" list might not look exactly like what's shown in the book. So, don't be alarmed if Mr. Matt Damon is missing from your list. Don't stop at 5; try 10, 20.
Don't forget to import random!
If you're done with what's in the book, it's time to try something new. See how the classifier classifies this short and fake movie review.
>>> myreview = """Mr. Matt Damon was outstanding, fantastic, excellent, wonderfully
subtle, superb, terrific, and memorable in his portrayal of Mulan.""">>> myreview_toks = nltk.word_tokenize(myreview.lower()) # lowercase, and then tokenize>>> myreview_toks
['mr.', 'matt', 'damon', 'was', 'outstanding', ',', 'fantastic', ',', 'excellent', ',',
'wonderfully', 'subtle', ',', 'superb', ',', 'terrific', ',', 'and', 'memorable', 'in',
'his', 'portrayal', 'of', 'mulan', '.']>>> myreview_feats = document_features(myreview_toks) # generate word feature dictionary>>> classifier.classify(myreview_feats) # classify ?? >>> classifier.prob_classify(myreview_feats).prob('pos') # probability of 'pos' label ?? >>> classifier.prob_classify(myreview_feats).prob('neg') # probability of 'neg' label ?? >>>
I know you're curious about myreview_feats, but printing/flashing it in its entirety is a bad idea. It is a large dictionary with 2,000 dimensions, because the feature generator function creates a True/False entry for each of the top 2,000 words, no matter the size of the input movie review text. So, instead, you should list-ify myreview_feats.items() and then print out slices.
This time, change "Matt Damon" to "Steven Seagal" (IMDB profile) and see what happens.
Now try this review, even shorter but still made up of positive words. Are you surprised by the result? (You should be.)
>>> myreview = "Mr. Matt Damon was outstanding, fantastic."
Take a look at the explanation below; it will likely not make a whole lot of sense at this point, but try and keep this tidbit in your memory. We will revisit this after Homework 4.
This surprising result comes from the fact that under this particular classifier model, all reviews, long or short, get represented by exactly the same set of 2,000 presence/absence word features. Even though this short review has 8 word tokens, there are (at least) 1,992 other features also simultaneously voting for 'pos' and 'neg' labels. In this case, these "absent" word features voted heavily towards 'neg' (e.g., enjoyed was not found, therefore up the 'neg' prediction); the presence of Matt, Damon, outstanding, fantastic -- all strong features towards 'pos' -- didn't have enough collective sway.
When you're done, save your IDLE shell session as a .txt file. Edit the file to clean out long and messy bits, and add your notes/comments.
SUBMIT:
A saved shell session as a .txt file, edited to include your comments