Go to: LING 1330/2330 home page  

Homework Assignment 2: A Duel of Bigrams

This homework assignment contains two parts.
We will continue to explore the two corpora: the Bible and the Jane Austen novels from Homework 1, part of NLTK's Project Gutenberg Selections corpus. Again the files are:
  • The King James Version Bible (bible-kjv.txt)
  • Jane Austen novels (austen-emma.txt, austen-persuasion.txt, austen-sense.txt)

PART 1: Bigram Frequencies of the Bible and Jane Austen Novels [35 points]


In PART 1, we will take a close look at the bigram frequencies of two corpora. We are interested in what types of word bigrams are frequently found in either corpus, and also what types of words are found following the word 'so', and in what probability. Additionally, we will pickle the bigram frequency dictionaries so we can re-use them later. To achieve these goals, complete this TEMPLATE script, which:
  1. imports necessary modules,
  2. opens the text files for the two corpora and reads in the content as text strings,
  3. builds the following objects, b_ for the Bible and a_ for Austen:
    1. b_toks, a_toks: word tokens, all in lowercase
    2. b_tokfd, a_tokfd: word frequency distribution
    3. b_bigrams, a_bigrams: word bigrams, cast as a list
    4. b_bigramfd, a_bigramfd: bigram frequency distribution
    5. b_bigramcfd, a_bigramcfd: bigram (w1, w2) conditional frequency distribution ("CFD"), where w1 is construed as the condition and w2 the outcome
  4. pickles the two bigram CFDs (conditional frequency distributions) using the highest binary protocol: name the files bible_bigramcfd.pkl and austen_bigramcfd.pkl.
  5. answers the following questions by exploring the objects:
    1. How many word tokens and types are there, for each corpus?
      Compare the overall size of the two corpora. Which one is larger?
    2. What are the top 20 most frequent words and their counts, for each corpus?
      Make a comparison. Anything noteworthy?
    3. What are the top 20 most frequent word bigrams and their counts, for each corpus?
      Make a comparison between the two corpora. What observations can you make?
    4. How many times does the word 'so' occur in each corpus? What are their relative frequency against the corpus size (= total # of tokens)?
      Judging by the relative frequency, in which corpus is 'so' more frequently found, and by how much?
    5. In each corpus, what are the top 20 'so-initial' bigrams (bigrams that have the word so as the first word) and their counts?
      Do a cross-comparison. What observations can you make? Is Bible's use of so similar to Austen's?
    6. In The Bible, given the word 'so' as the current word, what is the probability of getting 'much' as the next word? How about in Jane Austen novels? How about 'will' -- how does it fare as the next word? Provide a cross-comparison summary.


PART 2: Bigram Speak [15 points]

Next up, let's now use the pickled data for some fun. We will plug them into a program called "Bigram Speak". Instructions:

  1. Download BigramSpeak.py.
  2. The program won't run as it is. You need to modify it first by doing the following:
    1. Plug in one of your bigram CFDs (conditional frequency distributions) by unpickling and loading one of your pickled CFDs, assigning it to the variable w1w2f.
    2. Choose the appropriate title for your session by uncommenting one of the provided value assignments for title.
  3. Try out the program. Make sure to try the word 'so', and also the ENTER key. Try out a few different runs to get a sense of how the program works.
  4. Now do the same with the other corpus data.

Save out your shell session as a text file (.txt extension). Open it up in a text-editor program, and at the end of the file add your answers to the following questions:

  1. Examine the code closely to understand how it works. What does the interactive portion of the program do? Describe in your words what it does, step by step.
  2. Copy and paste a passage that the script produced, mostly through random selection, which you found particularly Bible-like. What strikes you?
  3. Do the same with Jane Austen.
  4. Provide a summary of your overall assessment of this program. Your impression, observations, comparisons between Bible Speak and Jane Austen Speak. Anything else that strikes you.
SUBMIT:
  • PART 1: The completed bible_austen_bigrams.py script, and its shell-side output saved as a text file bible_austen_bigrams.OUT.txt. You don't need to submit the pickle files.
  • PART 2: A saved shell session as a text file containing your answers and summary write-up at the end.