Examples:
Related:
Exercises (optional):
Regular Expressions
WARNING: This is a woefully incomplete overview of regular expressions. It would be absurd to try to fully cover the topic in a short handout like this. Hopefully, this will provide some of the basics to get you started, but to really understand regular expressions, I implore you to read as much of Mastering Regular Expressions by Jeffrey E.F. Friedl as you have time for.
A regular expression is a sequences of characters that describes or matches a given amount of text. For example, the sequence
A truly wonderful book written on the subject is: Mastering Regular Expressions by Jeffrey Friedl. Chapter 1, available via the Safari Network (through NYU) can be found here:
http://safari.oreilly.com/0596002890/mastregex2-CHP-1
Regular expressions (sometimes referred to as ‘regex’ for short) have both literal characters and meta characters. In
In this case, the ‘^’ is a meta character, i.e. it does not want to match the character ‘^’, but instead indicates the “beginning of a line.” In other words the regex above would find a match in:
bob goes to the park.
but would not find a match in:
jill and bob go to the park.
Here are a few common meta-characters (I’m listing them below as they would appear in a Java regular expression, which may differ slightly from perl, php, .net, etc.) used to get us started:
Position Metacharacters:
^ beginning of line $ end of line \\b word boundary \\B a non word boundary
Single Character Metacharacters:
. any one character \\d any digit from 0 to 9 \\w any word character (a-z,A-Z,0-9) \\W any non-word character \\s any whitespace character (tab, new line, form feed, end of line, carriage return) \\S any non whitespace character
Quantifiers (refer to the character that precedes it):
? appearing once or not at all
* appearing zero or more times
+ appearing one or more times
{min,max} appearing within the specified range
Using the above, we could come up with some quick examples:
Character Classes allow one to do an “or” statement amongst individual characters and are denoted by characters enclosed in brackets, i.e.
Another key metacharacter is |, meaning or. This is known as the concept of Alternation.
note: this regex could also be written as
Parentheses can also be used to constrain the alternation, i.e.:
Parentheses also serve the purpose of capturing groups for back-references. For example, examine the following regular expression:
The first part of the expression without parentheses would read:
This is really really super super duper duper fun. Fun!
egrep
grep is a unix command line utility that takes an input file, a regular expression and outputs the lines that contain matches for that regular expression. It’s a quick way for us to test some regexes (and we can use it on ITP’s server or on any Mac OS X machine.) As a point of history, the name comes from the form “g/re/p” which stands for “Global Regular Expression Print.” We’ll be used egrep, which allows for more sophisticated regular expression searches. (Note: the examples below use a slightly different regex “flavor” than what we will see in Java. This is something we’ll have to get used to, and will likely cause a bit of confusion. Not to worry, confusion over regular expression flavors is extremely normal. No need to seek professional help.)
The syntax is simple:
egrep -flags ‘regexpattern’ filename
If we want to output a file:
egrep -flags ‘regexpattern’ filename >> outputfilename
% egrep -i 'four' bible.txt % egrep -i 'five' bible.txt

The -i flag indicates that the match should be case-insensitive. You can find full documentation for the “egrep” command here (with full flags): http://www.unet.univie.ac.at/aix/cmds/aixcmds2/egrep.htm.
Let’s look at some other examples (special thanks to Friedl’s Mastering Regular Expressions).
Match URL’s:
% egrep -i 'http://[^ ]*' a2z.txt
(run this with the following sample file: a2z.txt)
Match double words:
% egrep -i '\\< (\w+) +\\1\\>' doubletext.txt
(run this with the following sample file: doubletext.txt)
(Note, in the above example, the metacharacter
Regular Expressions in Java
With Java 1.4, Sun introduced the java.util.regex package. Having regex support come standard with Java is a great thing, and there are many advantages to working with regexes in a robust object-oriented environment. Nevertheless, unlike with Perl (where regexes are a low-level component of the language), using regexes in Java can prove to be a bit awkward. The following will offer a brief overview of using regexes in Java, for more information I would suggest reading Chapter 8 of Mastering Regular Expressions, the book Java Regular Expressions, and the online Sun tutorial.
Making a String into a Regular Expression
Perl accepts normal strings as regular expressions, which makes life lovely. With Java, however, a regular expression is a Pattern object that is made with a String. We have to deal with Java’s own String metacharacters when putting together a String that will be used as a Regular Expression. In other words, in Java if you use a backslash in a String, it will be considered as a metacharacter, i.e.:
String newline = "\\n";
To actually have a backslash in a regular expression, we need to escape it with another backslash, i.e.:
String newlineregex = "\\\\n";
Conceptually, it might take us a moment to wrap our heads around this distinction, nevertheless, functionally, in Java, the solution is simple: whenever you want to have backslash in your regex, use 2!
Ok, moving on to using a regex in Java, our program must impor the java.util.regex package:
import java.util.regex.*
The classes we will use are as follows:
Our first regex program will follow this pseudo-code:
Ok, let’s take a look at the actual code:
import java.util.regex.*;
public class RegexHelloWorld {
public static void main(String[] args) {
String inputtext = "This is a test of regular expressions."; // Step #1
String regex = "test"; // Step #2
Pattern p = Pattern.compile(regex); // Step #3
Matcher m = p.matcher(inputtext); // Step #4
if (m.find()) {
System.out.println(m.group()); // Step #5
} else {
System.out.println("No match!"); // Step #6
}
}
}
Note the use of the find() method, which attempts to find the next subsequence of the input sequence that matches the pattern (returns true or false based on whether it finds something) and group() which returns the input subsequence captured by the given group during the previous match operation.
If we want to look for multiple matches, we can simply use a “while” loop instead of an “if”:
String regex = "\\\\b(\\\\w+)\\\\b\\\\W+\\\\1"; // Regex that matches double words
Pattern p = Pattern.compile(regex); // Compile Regex
Matcher m = p.matcher(content); // Create Matcher
while (m.find()) {
System.out.println(m.group());
}
We can also add flags when compiling the regex. For example, if we want to have a case insensitive regex:
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Two flags can be added using the bitwise OR, i.e. |
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
It’s easy to notice, how easy it would be to improve the Flesch Index example from last week. For example, we could use a regular expression to very quickly count vowels:
String regex = "[aeiou]";
Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
int vowelcount = 0;
Matcher m = p.matcher(content); // Create Matcher
while (m.find()) {
vowelcount++;
}
System.out.println("Total vowels: " + vowelcount);
Splitting with Regular Expressions
It should briefly be noted that the split function we examined last week actually takes a regular expression as an argument. An input String is split into an array wherever any part of that input String that matches that regular expression. For example. . .
String regex = "\\\\W"; // Use any "non-word character" as a delimiter
String[] words = content.split(regex);
System.out.println("Total words: " + words.length);
. . .is a very quick way to use regular expressions to count the # of words (This method is not perfect by any means.)
Search and Replace
Running a search and replace is one of the more powerful things one can do with regular expressions. In Java, it’s simple. The String function itself has a replaceAll() method built-in. The method takes two arguments, a regex and a replacement String. Wherever there is a regex match, it is replaced with the String provided, i.e.:
String input = "Replace every time the word "the" appears with the word ze."; String regex = "\\\\bthe\\\\b"; // Use any "non-word character" as a delimiter String output = input.replaceAll(regex,"ze");
Output yields: Replace every time ze word “ze” appears with ze word ze.
The replaceAll() method is also available in the Matcher class, i.e.:
String input = "Replace every time the word "the" appears with the word ze.";
String regex = "\\\\bthe\\\\b"; // Use any "non-word character" as a delimiter
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
String output = m.replaceAll("ze");
We can also reference the matched text using a backreference in the substitution string. A backreference to the entire match is indicated as
String input = "Anytime a sequence of one or more vowels appears, \n" +
"we're going to double the vowels.";
String regex = "[aeiou]+"; //
String output = input.replaceAll(regex, "$0$0");
Output yields:
Anytiimee aa seequeuencee oof oonee oor mooree vooweels aappeaears, wee’ree goioing too dououblee thee vooweels.
The closing example from this week using a regular expression to remove all HTML tags from a source file. A nice way to write regular expressions is to start with an exact text and then slowly generalize it, i.e.:
Let’s start with the regular expression:
Ok, now let’s generalize it to be:
(less than followed by 5 word characters followed by a greater than)
Well, this can be further generalized to:
But really we should allow for white spaces, punctuation, and other characters inside the opening and closing brackets. Basically, we want to allow for any character that is not “>”!
The code to replace this match with nothing is then:
// A Regex to match anything in between <>
// Reads as: Match a "< "
// Match one or more characters that are not ">"
// Match "< ";
String tagregex = "<[^>]*>";
Pattern p2 = Pattern.compile(tagregex);
Matcher m2 = p2.matcher(content);
count = 0;
// Just counting all the tags first
while (m2.find()) {
//System.out.println(m.group());
count++;
}
// Replace any matches with nothing
content = m2.replaceAll("");
System.out.println("Removed " + count + " other tags.");
Related Perl / PHP Examples
Perl version of the vowel doubler:
#!/usr/bin/perl undef $/; # File "slurp" mode $stuff = <>; # read in the first file # double any vowel occurences # g -- global # i -- case insensitive $stuff =~ s/([aeiou]+)/$1$1/g; print $stuff;
PHP:
Run it: http://www.shiffman.net/itp/classes/a2z/week02/voweldoubler.php
Source: http://www.shiffman.net/itp/classes/a2z/week02/voweldoubler.phps
In your example, \b([0-9A-Za-z]+)\s+\1\b, you say “The third part \1 says match whatever you matched that was enclosed inside the first set of parentheses.” However, where does this description come from? I’m just learning regex and I thought the backslash meant to treat the following character as a literal character.
Thanks!
Dale
Is there a way to run egrep from the command line that will return only the matched pattern and not the entire line of text that that pattern was found in?
Christin