Go to: LING 1330/2330 home page  

Homework Assignment 3: Bulgarian vs. Japanese EFL Writing

Comparative Analysis of Two Learner Corpora
Between Bulgarian and Japanese college students, which group writes English on a more advanced level? Let's explore the question with real data: 60 English essays written by Japanese and Bulgarian students, excerpted from the ICLE2 (International Corpus of Learner English v2) corpus. Among many measurements that can be used as an indicator for writing quality, we will try our hands on these metrics:
  1. Syntactic complexity: Are the sentences simple and short, or are they complex and long?
    → Can be measured through average sentence length
  2. Lexical diversity: Are there more of the same words repeated throughout, or do the essays feature more diverse vocabulary?
    → Can be measured through type-token ratio (with caveat!)
  3. Vocabulary level: Do the essays use more of the common, everyday words, or do they use more sophisticated and technical words?
    → a. Can be measured through average word length (common words tend to be shorter), and
    → b. Can be measured against published lists of top most frequent English words
You will notice that items 1, 2 and 3a can be measured corpus-internally. Item 3b will have to rely on an external, pre-compiled list of common English words. We have such a resource handy: Peter Norvig's list of the 1/3 million most frequent words and their counts will do nicely.


PART 1: Prepare Vocabulary Bands [15 points]


Norvig's list of the 1/3 million most frequent unigram counts consists of the top 333k entries of the famous Google Web 1T 5-gram dataset. In class, we processed the count_1w.txt file into a list and a FreqDist, which we pickled. In this part, we will process them further to use in our study.

First, a bit of background. In SLA (second language acquisition) literature, vocabulary levels are commonly grouped into frequency bands. For example, Lextutor's VocabProfile identifies each word in a submitted passage as belonging to 1k types, 2k types, and off types, the idea being that 1k-type words are among the top 1,000 most frequent English words and 2k-type words belong to the next 1,000 frequent ones, etc. Rather than talking about a particular word being ranked at, say, 1,302th in terms of frequency, we can talk about this word being in the 2k band, which holds a certain intuitive appeal.

So: let's get to it. Starting from your Google list of (word, count) tuples (we called it goog1w_rank in class), build a new dictionary, named goog_kband, where the key is a word and the value is its k-band. Specifics:

  • Ranks 1-1,000 should have the k-band value of 1, and
  • ranks 1,001-2,000 should have the k-band value of 2, etc.
  • We will however limit ourselves to 20 such bands: all words beyond the rank of 20,000 should be excluded from this dictionary.
Here is Python code in shell. Don't be shy! I only hid it to save space:

When it is ready, explore it along with the two original Google data objects (goog1w_rank and goog1w_fd) in IDLE shell and explore the questions below.

  1. What are the ranks of teacher and student? What are their k-bands?
  2. Find a word that fits each of the 20 k-bands. Do their bands align with your own intuition?
  3. What are some examples of English words not found in the top 20k range?
  4. What is the average vocabulary band of the sentence 'I am very tired'?
  5. How about 'I am utterly exhausted' this time?
When you are done, pickle goog_kband so it can be used in PART 2. Then, save your IDLE shell session as hw3_vocab_band_shell.txt. Open it up in your text editor, clean up messy parts, and then add your answers to the questions above to accompany your relevant code bits.


PART 2: Bulgarian & Japanese Learner Corpora [45 points]


We are now ready to get up close with the 60 essay files by Bulgarian and Japanese students. Download the template script and the zipped archive of the corpus:

A note about this homework: you will be essentially composing two documents: (1) a word-processed file which works as a written report on the whole investigation, and (2) a Python script. Because of this set up, you should not include your observations in your script. Instead, write them up in the word-processed document while referencing findings from the code output.

First order of business: take an old-fashioned look at some of the student essays, using your favorite text file reader. What are your first impressions? Write them down in your report document.

Next, it's Python time. The script is structured as follows:

  1. Preparation: import libraries and unpickle data files.
  2. Load the two corpora using NLTK's PlaintextCorpusReader. Print out some basic specs.
  3. Build the usual data objects, based on all lower-case tokens.
  4. Compute measurements for writing quality (more below), print out results.
  5. Print out unigram and bigram frequencies (more below).
Here are more details on [D] the various measurements intended to aid us in assessing the writing quality of the two learner groups.
  1. Average essay length. What is the average length of the Bulgarian essays? How about the Japanese one?
  2. Average sentence length. What is the average sentence length of the Bulgarian writings? How about Japanese?
  3. Lexical diversity. Which group uses more diverse vocabulary? Find the type-token ratio of the two corpora.
  4. Average word length. Which group uses longer words -- Bulgarian or Japanese? Find the answer through the average word token length measured in # of characters. Exclude tokens that are symbols or punctuation in your calculation. Note that these should be calculated on tokens, not types.
  5. Average vocabulary band. Compute, for each group, the average vocabulary band of the words used.
    • Again, we need to calculate these on a per-token basis, not per-type. 'I am a platypus, I really really am' is lower-band-vacabularied on average than 'I am really a platypus', which share the exact same word types.
    • But crucially, exclude from the calculation any words that are not found in the 20 bands. That is, if a text consists of 6 tokens which have bands [2, 8, 13, 8, not-in, 17], then the average k-band should be calculated as (2+8+13+8+17) divided by 5, and not 6. Think about it: if we divided by 6, the "not-in" word is in effect given the vocabulary band of 0, which is not right. Essentially, then, we're treating these out-of-band words as if they are not even there.
    • You may ask: why exclude 21+ band words at all? The answer is, 21+ band words in learner-produced texts are more likely to be misspellings, personal names and other oddities than advanced vocabulary.
  6. % of 11+ band word types. Of all word types found in each corpus, what % comes from band 11~~20? What are some example words?
    • Again, we do not count words from 21+ bands, but the base of the division this time should still be *all* types. So, if a corpus had 150 word types where 75 are from bands 1~~10, 45 from bands 11~~20, and 30 out-of-band words, then the % is calculated as 45/150 *100.
Now more on [E] unigram and bigram frequencies. Address the following:
  • Compare the top frequencies across the two learner groups. Are there any noticeable differences in the overall rankings and/or make-up? What could they suggest in terms of their writing levels?
  • Beyond comparing the top most frequent n-grams, how else could you use n-gram statistics for the purpose of assessing EFL/ESL writing quality? Could large-scale, native-corpus-sourced n-gram frequency lists such as the Norvig/Google bigram lists and the COCA n-gram lists be useful, and in what way?

In your written report, citing these findings (cite, don't just paste-in screenshots!), compose a comparison summary of the English writing quality of the Bulgarian and Japanese college students. Include your assessment on how well these metrics capture the two groups' writing levels.

SUBMIT:
  • PART 1: hw3_vocab_band_shell.txt and goog_kband.pkl
  • PART 2: (1) Your word-processed document containing a written report, (2) ICLE_efl_writing.py, (3) ICLE_efl_writing.OUT.txt (script output saved as a .txt file).