Na-Rae Han (naraehan@pitt.edu), 2/17/2017, Pitt Library Workshop

Preparation

Jupyter tips:

  • Shift+ENTER to run cell, go to next cell
  • Alt+ENTER to run cell, create a new cell below

More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/

The very basics

First code

  • Printing a string, using print().
In [ ]:
print("hello, world!")

The string type

  • String type objects are enclosed in quotation marks.
  • + is a concatenation operator.
  • Below, greet is a variable name assigned to a string value; note the absence of quotation marks.
In [ ]:
greet = "Hello, world!"
greet + " I come in peace."
  • String methods such as .upper(), .lower() transform a string.
In [ ]:
greet.upper()
  • len() returns the length of a string in the # of characters.
In [ ]:
len(greet)

Numbers

  • Integers and floats are written without quotes.
  • You can use algebraic operations such as +, -, * and / with numbers.
In [ ]:
num1 = 5678
num2 = 3.141592
result = num1 / num2
print(result)

Lists

  • Lists are enclosed in [ ], with elements separated with commas. Lists can have strings, numbers, and more.
  • Like with string, you can use len() to get the size of a list.
  • Like with string, you can use in to see if an element is in a list.
In [ ]:
li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)
In [ ]:
'blue' in li

for loop

  • Using for loop, you can loop through a list of items, applying the same set of operations to each element.
  • Just like the conditionals, the embedded code block is marked with indentation.
In [ ]:
for x in li :
    print(x, len(x))

List comprehension

  • List comprehension builds a new list from an existing list.
  • You can filter in only certain elements, and you can apply transformation in the process.
  • Try: .upper(), len(), +'ish'
In [ ]:
[x for x in li if x.endswith('e')]
In [ ]:
[x+'ish' for x in li]
In [ ]:
[len(x) for x in li]

Dictionaries

  • Dictionaries hold key:value mappings.
  • len() on dictionary returns the number of keys.
In [ ]:
di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Bart']
In [ ]:
len(di)

Using NLTK

  • NLTK is an external module; you can start using it after importing it.

  • nltk.word_tokenize() is a handy tokenizing function out of literally tons of functions it provides.

  • It turns a text (a single string) into a list tokenized words.

In [ ]:
import nltk
In [ ]:
nltk.word_tokenize(greet)
In [ ]:
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)
  • nltk.FreqDist() is is another useful NLTK function.
  • It builds a frequency dictionary from a list.
In [ ]:
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)
In [ ]:
freq = nltk.FreqDist(toks)
freq
In [ ]:
freq.most_common(3)
In [ ]:
freq['rose']
In [ ]:
len(freq)

Reading in a text file

  • open(filename).read() reads in the content of a text file as a single string.
In [ ]:
myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Mac users should leave out C:
wtxt = open(myfile).read()
print(wtxt)
In [ ]:
len(wtxt)     # Number of characters in text
In [ ]:
'fellow citizens' in wtxt

Tokenize text, compile frequency count

In [ ]:
nltk.word_tokenize(wtxt)
In [ ]:
wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text
In [ ]:
wfreq = nltk.FreqDist(wtokens)
wfreq['citizens']
In [ ]:
len(wfreq)      # Number of unique words in text
In [ ]:
wfreq.most_common(40)     # 40 most common words

Average sentence length, frequency of long words

In [ ]:
sentcount = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or ?
sentcount
In [ ]:
len(wtokens)/sentcount     # Average sentence length in number of words
In [ ]:
[w for w in wfreq if len(w) >= 13]       # all 13+ character words
In [ ]:
long = [w for w in wfreq if len(w) >= 13] 
for w in long :
    print(w, len(w), wfreq[w])               # long words tend to be less frequent

Your turn: process the 2009-Obama.txt file.

What next?

Take a Python course!