Go to: Na-Rae Han's home page  

Python 3 Notes

        [ HOME | LING 1330/2330 ]

Regular Expressions

<< Previous Note           Next Note >>
On this page: re module, re.findall(), re.sub(), re.compile(), re.search(), re.search().group().

The re Module

So you learned all about regular expressions and are ready to use them in Python. Let's get to it! The re module is Python's standard library that handles all things regular expression. Like any other module, you start by importing it.
 
>>> import re
>>>  

Finding All Matches in a String

Suppose you want to find all words starting with 'wo' in this very short text below. What you want to use is the re.findall() method. It takes two arguments: (1) the regular expression pattern, and (2) the target string to find matches in.
 
>>> wood = 'How much wood would a woodchuck chuck if a woodchuck could
chuck wood?'
>>> re.findall(r'wo\w+', wood)        # r'...' for raw string
['wood', 'would', 'woodchuck', 'woodchuck', 'wood'] 
>>> 
First of all, note that the regular expression r'wo\w+' is written as a raw string, as indicated by the r'...' string prefix. That is because regular expressions, as you are aware by now, use the backslash "\" as their own special escape character, and without 'r' the backslash gets interpreted as *Python's* special escape character. Basically, on Python's string object level, that "\" in "\w" should be interpreted as a literal backslash character so that it can later be interpreted as a regular expression's special escape character when the string is processed by the re module. If this all sounds too complicated, just remember to ALWAYS PREFIX YOUR REGULAR EXPRESSION WITH 'r'.

Back to re.findall(). It returns all matched string portions as a list. If there are no matches, it will simply return an empty list:
 
>>> re.findall(r'o+', wood)
['o', 'oo', 'o', 'oo', 'oo', 'o', 'oo'] 
>>> re.findall(r'e+', wood)
[] 
What if you want to ignore case in your matches? You can specify it as a third optional argument: re.IGNORECASE.
 
>>> foo = 'This and that and those'
>>> re.findall(r'th\w+', foo)
['that', 'those'] 
>>> re.findall(r'th\w+', foo, re.IGNORECASE)    # case is ignored while matching
['This', 'that', 'those'] 
>>> 

Substituting All Matches in a String

What if you want to replace all matching portions with something else? It can be done using the re.sub() method. Below, we are finding all vowel sequences and replacing them with '-'. The method returns the result as a new string.
 
>>> wood
'How much wood would a woodchuck chuck if a woodchuck could chuck wood?' 
>>> re.sub(r'[aeiou]+', '-', wood)    # 3 args: regex, replacer string, target string
'H-w m-ch w-d w-ld - w-dch-ck ch-ck -f - w-dch-ck c-ld ch-ck w-d?' 
>>> 
Removing the matching portions can also be achieved through re.sub(): just make the "replacer" string an empty string ''.
 
>>> re.sub(r'[aeiou]+', '', wood)    # substitute with an empty string
'Hw mch wd wld  wdchck chck f  wdchck cld chck wd?' 
>>> 

Compiling a Regular Expression Object

If you have to match a regular expression on many different strings, it is a good idea to construct a regular expression as a python object. That way, the finite-state automaton for the regular expression is compiled once and reused. Since constructing a FSA is rather computationally expensive, this lightens processing loads. To do this, use the re.compile() method:
 
>>> myre = re.compile(r'\w+ou\w+')     # compiling myre as a reg ex
>>> myre.findall(wood)                 # calling .findall() directly on myre
['would', 'could'] 
>>> myre.findall('Colorless green ideas sleep furiously')
['furiously'] 
>>> myre.findall('The thirty-three thieves thought that they thrilled 
the throne throughout Thursday.')
['thought', 'throughout'] 
Once compiled, you call a re method directly on the regular expression object. In the example above, myre is the compiled regular expression object corresponding to r'\w+ou\w+', and you call .findall() on it as myre.findall(). In doing so, you now need to specify one fewer arguments: the target string myre.findall(wood) is the only thing needed.

Testing if a Match Exists

Sometimes, we are only interested in confirming whether or not there is a match within the given string. For that, re.findall() is an overkill, because it scans the entire string to produce *every* matching substring. This is fine when you are dealing with a few short strings like we are here, but in the real world your strings might be much longer and/or you will be doing the matching thousands or even millions of times, so the difference adds up.

In this context, re.search() is a good alternative. This method only finds the first match and then quits. If a match is found, it returns a "match object". But if not, it returns... nothing. Below, r'e+' is successfully matched in the 'Colorless...' string, so a match object is returned. Funnily enough, there is not a single 'e' in our wood, so the same search returns nothing.
 
>>> re.search(r'e+', 'Colorless green ideas sleep furiously')
<_sre.SRE_Match object at 0x02D9CB48> 
>>> re.search(r'e+', wood)
>>> 
If you want to see the actual matched portion, you can use the .group() method defined on the match object. There's a problem though: it works fine when there is a match and therefore a match object has been returned, but when there is no match, there is no returned object, so...
 
>>> re.search(r'e+', 'Colorless green ideas sleep furiously').group()
'e' 
>>> re.search(r'e+', wood).group()
Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    re.search(r'e+', wood).group()
AttributeError: 'NoneType' object has no attribute 'group'
Therefore, you want to use re.search() in the context of an if statement. Below, the if ... line checks if there is a returned object by the re.search method, and only then you proceed to print out the matched portion and the matching line. (NOTE: if someobj returns True as long as someobj is not one of the following: "nothing", integer 0, an empty string "", an empty list [], and an empty dictionary {}.)
 
>>> f = open('D:\\Lab\\ling1330\\bible-kjv.txt')
>>> blines = f.readlines()
>>> f.close()
>>> smite = re.compile(r'sm(i|o)te\w*')
>>> for b in blines:
        matchobj = smite.search(b)
        if matchobj:         # True if matchobj is not "nothing"
            print(matchobj.group(), '-', b, end='')

smite - again smite any more every thing living, as I have done.
smote - were with him, and smote the Rephaims in Ashteroth Karnaim, and the
smite - hand of Esau: for I fear him, lest he will come and smite me, and the
smote - 36:35 And Husham died, and Hadad the son of Bedad, who smote Midian in
smitest - Wherefore smitest thou thy fellow?  2:14 And he said, Who made thee a
smite - 3:20 And I will stretch out my hand, and smite Egypt with all my
smite - behold, I will smite with the rod that is in mine hand upon the waters
smote - up the rod, and smote the waters that were in the river, in the sight
smite - 8:2 And if thou refuse to let them go, behold, I will smite all thy
smite - rod, and smite the dust of the land, that it may become lice
smotest - with thee of the elders of Israel; and thy rod, wherewith thou smotest
smite - and thou shalt smite the rock, and there shall come water out of it,
smiteth - 21:12 He that smiteth a man, so that he die, shall be surely put to
smiteth - 21:15 And he that smiteth his father, or his mother, shall be surely
smite - 21:18 And if men strive together, and one smite another with a stone,
... 
In the example, we want to pull out all lines in the Bible that has a 'smite/smote...' word in it. We first load up the bible file as a *list of lines* using the .readlines() method. And then, because we will be doing the matching many times over, we do the smart thing of compiling our regular expression. Then, for-looping through the Bible lines, we create a match object through .search(), and print out the matched portion and the line only if a match object exists.