Na-Rae Han (naraehan@pitt.edu), 5/30/2017, CMU DH Summer Workshop

Preparation¶

This tutorial is found on http://www.pitt.edu/~naraehan
Download and unzip the "C-Span Inaugural Address Corpus", available on NLTK's corpora page: http://www.nltk.org/nltk_data/
Place the unzipped "inaugural" folder on your DESKTOP

Jupyter tips:

Shift+ENTER to run cell, go to next cell
Alt+ENTER to run cell, create a new cell below

More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

The very basics¶

First code¶

Printing a string, using print().

print("hello, world!")

The string type¶

String type objects are enclosed in quotation marks.
+ is a concatenation operator.
Below, greet is a variable name assigned to a string value; note the absence of quotation marks.

greet = "Hello, world!"
greet + " I come in peace."

String methods such as .upper(), .lower() transform a string.

greet.upper()

len() returns the length of a string in the # of characters.

len(greet)

Numbers¶

Integers and floats are written without quotes.
You can use algebraic operations such as +, -, * and / with numbers.

num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", str(num2), "is", result)

Lists¶

Lists are enclosed in [ ], with elements separated with commas. Lists can have strings, numbers, and more.
Like with string, you can use len() to get the size of a list.
Like with string, you can use in to see if an element is in a list.
A list can be indexed through li[i]. Python indexes starts with 0.
A list can be sliced: li[3:5] returns a sub-list beginning with index 3 up to and not including index 5.

li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)

'mauve' not in li

# Try [0], [2], [-1], [3:5], [3:], [:5]
li[0]

for loop¶

Using for loop, you can loop through a list of items, applying the same set of operations to each element.
The embedded code block is marked with indentation.

for x in li :
    print(x, len(x))
print("Done!")

List comprehension¶

List comprehension builds a new list from an existing list.
You can filter in only certain elements, and you can apply transformation in the process.
Try: .upper(), len(), +'ish'

[x for x in li if x.endswith('e')]

[x+'ish' for x in li]

[len(x) for x in li]

Dictionaries¶

Dictionaries hold key:value mappings.
len() on dictionary returns the number of keys.

di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Homer']

len(di)

Using NLTK¶

NLTK is an external module; you can start using it after importing it.
nltk.word_tokenize() is a handy tokenizing function out of literally tons of functions it provides.
It turns a text (a single string) into a list tokenized words.

import nltk

nltk.word_tokenize(greet)

help(nltk.word_tokenize)

sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)

nltk.FreqDist() is is another useful NLTK function.
It builds a frequency dictionary from a list.

# First "Rose" is capitalized. How to lowercase? 
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)

freq = nltk.FreqDist(toks)
freq

freq.most_common(3)

freq['rose']

len(freq)

Processing a single text file¶

Reading in a text file¶

open(filename).read() reads in the content of a text file as a single string.

myfile = 'C:/Users/zoso/Desktop/inaugural/1789-Washington.txt'  # Mac users should leave out C:
wtxt = open(myfile).read()
print(wtxt)

len(wtxt)     # Number of characters in text

'fellow citizens'.lower() in wtxt.lower()  # phrase as a substring

'Americans' in wtxt

Tokenize text, compile frequency count¶

# Turn off/on pretty printing (prints too many lines)
%pprint

# Tokenize text
nltk.word_tokenize(wtxt)

wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text

# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['citizens']

wfreq['the']

len(wfreq)      # Number of unique words in text

wfreq.most_common(40)     # 40 most common words

Average sentence length, frequency of long words¶

sentcount = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or 
print(sentcount)

# Tokens include symbols and punctuation. First 50 tokens:
wtokens[:50]

wtokens_nosym = [t for t in wtokens if t.isalnum()]    # alpha-numeric tokens only
len(wtokens_nosym)

# First 50 tokens, alpha-numeric tokens only: 
wtokens_nosym[:50]

len(wtokens_nosym)/sentcount     # Average sentence length in number of words

[w for w in wfreq if len(w) >= 13]       # all 13+ character words

long = [w for w in wfreq if len(w) >= 13] 
for w in long :
    print(w, len(w), wfreq[w])               # long words tend to be less frequent

Processing a corpus¶

NLTK can read in an entire corpus from a directory (the 'root' directory).
As it reads in a corpus, it applies word tokenization (shown below) and sentence tokenization (not shown here).

from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/zoso/Desktop/inaugural'  # Mac users should leave out C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt'

# .txt file names as file IDs
inaug.fileids()

# NLTK automatically tokenizes the corpus. First 50 words: 
print(inaug.words()[:50])

# You can also specify individual file ID. First 50 words from Obama 2009:
print(inaug.words('2009-Obama.txt')[50:])

# NLTK automatically segments sentences too, which are accessed through .sents()
print(inaug.sents('2009-Obama.txt')[0])   # first sentence
print(inaug.sents('2009-Obama.txt')[1])   # 2nd sentence

# How long are these speeches in terms of word and sentence count?
print('Washington 1789:', len(inaug.words('1789-Washington.txt')), len(inaug.sents('1789-Washington.txt')))
print('Obama 2009:', len(inaug.words('2009-Obama.txt')), len(inaug.sents('2009-Obama.txt')))

# for-loop through file IDs and print out word count. 
for f in inaug.fileids():
    print(len(inaug.words(f)), f)

Trouble shooting¶

Unfortunately, 2005 Bush file produces Unicode encoding error.
Let's make a new text file from http://www.presidency.ucsb.edu/inaugurals.php
Copy text and paste in Notepad (Windows). Make sure to choose UTF-8 encoding and not ANSI.
The text files are locked; We will need to save, halt and then re-start the Python notebook.

# Corpus size in number of words
print(len(inaug.words()))

# Building word frequency distribution for the entire corpus
inaug_freq = nltk.FreqDist(inaug.words())
inaug_freq.most_common(100)

What next?¶

Take a Python course!

CMU 15-112 "Fundamentals of Programming and Computer Science"
CS 0008 "Introduction to Computer Programming with Python"
CS 0155 "Data Witchcraft"
LING 1330/2330 "Introduction to Computational Linguistics" (linguistics students)
And many MOOC courses: Coursera, EdX, udemy, DataCamp