Go to: LING2050 home page   Lab pages index   Command reference sheet

Lab 5

Objectives: more on Regular Expressions; simple perl scripts
Reference:

Overview

We learn about: extended regular expressions, perl syntax, and shell scripting for looping through files.

Regular Expressions, continued

  1. Some character classes are defined according to the POSIX standard, some of which we have used before. The following are the most commonly used and useful ones. Please see this page for detailed explanation.

    POSIX Character Classes:
    characterexplanation
    [:digit:] matches any single digit (0-9)
    [:alnum:] matches any alphanumeric character (0-9, A-Z, a-z)
    [:alpha:] matches any alphabetic character (A-Z, a-z)
    [:upper:] matches any uppercase alphabetic character (A-Z)
    [:lower:] matches any lowercase alphabetic character (a-z)
    [:blank:] matches SPACE and TAB
    [:punct:] matches punctuation symbols: . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~

    Note that these are always used inside an additional set of square brackets, e.g., [[:digit:]] or [[:digit:][:alpha:]], to match a single character. By contrast, [[:digit:]][[:alpha:]], enclosed in two sets of outer square brackets, matches a string of two characters:

    $ echo 'this is 20th century...!' | grep -E '[[:digit:][:alpha:]]'
    this is 20th century...!

    $ echo 'this is 20th century...!' | grep -E '[[:digit:]][[:alpha:]]'
    this is 20th century...!

  2. In addition to the more standard set of Regular Expression syntax, many programming languages (PHP, Perl, python, java, etc.) support the following common extensions. Note that the lowercase version '\x' and the uppercase version '\X' work as a pair; the latter is the complement of the former. See this page for more information.

    Character Class Abbreviations:
    characterexplanationmatchesdoes not match
    \d matches any single digit (0-9) '1', '23', '23rd' 'twenty'
    \D matches any character NOT in the range 0-9 'Amy', 'twenty', '23rd', '0.25', '1,000' '1', '23'
    \s matches any whitespace character (space, tab, etc.) ' ', 'this hat', 'hat ''hat'
    \S matches any character that is non-whitespace 'this hat', 'hat'' '
    \w matches any alphanumeric character ('0-9', 'A-Z', 'a-z') 'Amy', 'a', 'A', '12', '23rd' '...', '!', ' '
    \W matches everything else (punctuation, symbol, whitespace) 'my hat', 'yes!', '3/4' '23rd', 'Amy'

    You might have noticed that sed and grep do not accept these abbreviations. So, why bother learning these? The answer is that you can use them in perl, and later, antconc. First, let's learn the basics of perl...

Doing Everything with perl

  1. We are first going to write an extremely simple perl script. Open up a text file named greetings.pl in pico, by typing:
    pico greetings.pl
    and type in the following line:
    print "hello world\n";
    Save and exit using Ctl-X. Now execute the perl script:
    $ perl greetings.pl
    hello world

  2. In the case above, we executed a perl script saved in a separate file (greetings.pl). Since the code itself is extremely simple, we do not have to rely on a script file at all; using the -e switch, the code can be supplied from the command-line:
    $ perl -e 'print "hello pretty\n";'
    hello pretty
    Another useful switch is -n, which loosely translates to "do something for every line of the standard input". Therefore, the following command simply prints out each and every line of the input file.
    $ perl -ne 'print;' austen-emma.2gram | more
    emma    by
    by      jane
    jane    austen
    austen  1816
    1816    volume
    volume  i
    i       chapter
    chapter i
    i       emma
    emma    woodhouse
    

  3. In perl, regex patterns are enclosed in / /. You can use perl -ne 'print if /PATTERN/;' and perl -ne 'print unless /PATTERN/;' to simulate grep and grep -v, respectively:
    $perl -ne 'print if /^thou\t/;' gutenberg.2gram | more
    thou    the
    thou    mayest
    thou    shalt
    thou    eatest
    thou    shalt
    thou    3
    thou    wast
    thou    eaten
    
    $ perl -ne 'print unless /[[:alpha:]]+\s[[:alpha:]]+/;' gutenberg.2gram | more
    the     8th
    the     23rd
    the     28th
    sept    28th
    the     24th
    the     7th
    of      10
    10      000
    000     l
    the     10
    

  4. Perl also provides a sed-like syntax that lets you edit your text stream on the fly. perl -ne 's/PATTERN1/PATTERN2/g; print;' achieves just that. Therefore, this good-old sed command:
    sed 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g' austen-emma.txt
    is the same as:
    perl -ne 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g; print;' austen-emma.txt

  5. And, of course, perl provides a tr syntax as well. The following two commands are therefore equivalent:
    tr '[A-Z]' '[a-z]' < austen-emma.txt
    perl -ne 'tr/[A-Z]/[a-z]/; print' austen-emma.txt

  6. Which means: we could process tokenization of a file using perl alone! Remember this is how we did tokenization before:
    tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' | more
    which can now be done using only perl:
    cat austen-emma.txt | perl -ne 'tr/[A-Z]/[a-z]/; s/[[:punct:]]/ /g; s/ /\n/g; print;' | perl -ne 'print unless /^$/;' | more
    Note that the last perl command cannot be incorporated with the previous set and needs to be executed after piping. What could be the reason behind this?

Repeating Commands on Multiple Files

  1. For your homework, you bravely and diligently applied the same command to every text file. With some basic bash shell scripting, all these repetitions can be neatly packed into one round of command execution. CAUTION: this looping syntax is extremely powerful -- it would be a good idea to back up your data, and/or do trial runs.

  2. Starting from the "austen-emma.words" tokenized word file, this is how you would obtain a word frequency file:
    cat austen-emma.words | sort | uniq -c | sort -nr > austen-emma.words.freq

    Note that the output file name has the input file name built into it. We can designate this portion as a variable and formulate a loop syntax so it applies to every file in the directory that ends with .words:

    $ for myfile in *.words
    > do
    > cat $myfile | sort | uniq -c | sort -nr > $myfile.freq
    > echo $myfile finished.
    > done
    
    The words in red are part of the bash shell scripting syntax. myfile is the variable; it is first used without the $ prefix and then with one throughout. The echo command is not essential, but it provides handy feedback. As you type in RETURN at the end of the line, your shell recognizes that your command is not complete and prompts with >, until done is typed in.