Regular Expressions

Developed by Hardeep Singh
Copyright © Hardeep Singh, 2002
EMail seeingwithc@hotmail.com
Website seeingwithc.org
All rights reserved.
The code may not be used commercially without permission.
The code does not come with any warranties, explicit or implied.
The code cannot be distributed without this header.

Problem: We have a file (or a set of files) and we want to search for all email IDs present within them. The email IDs may be embedded within sentences and may be found anywhere in the file. We have a set of rules that specify what constitutes a valid email ID.
As we shall see, the solution we will discover will be quite generic and will solve a whole class of search / replace problems, not just this.

Solution 1:

Usually, we take this opportunity to describe a naive solution to the problem. Here, I will describe an approach briefly and then move on.
One way to solve the problem is to come up with some procedural logic that will scan each character in the input file(s) and check it to see if it can be a part of an email ID. Basically, it will search for an '@' sign and look to its left and right to decide. However, any such solution has to be more than say, 60 lines long piece of complex code. Let's try something different.

Solution 2:

Regular Expressions are used very widely for search / replace and other problems. Mastering even the basics of regular expressions will allow you to manipulate text with surprising ease.
Before starting, let us define the terms we will use often in this text. A set is an unordered collection of distinct elements having specific common properties. For example, the set of positive integers, {0,1,2,3,...}. A string is a set of consecutive characters.
A regular expression (or regexp, or pattern, or RE) is a text string that describes some (mathematical) set of strings. A RE matches a string s if s is in the set of strings described by r.

Using a RE library, you basically can:

We have all used simplified forms of regular expressions like the file search patterns used by MS DOS, for example, dir *.doc. The use of regular expressions in computer science was made popular by a Unix based editor, 'ed'. However, Perl was the first language that provided integrated support for REs. They are now supported by almost all languages - JavaScript, Java, VB, C++ and C# either through language support or third party libraries. There are many Java regular expression libraries available - GNU Regexp for Java, Jakarta Regexp and Jakarta ORO to name a few. Support for Regular Expressions is also available in Java 1.4. Regular Expressions were originally studied as a part of Theory of Computation. Mathematically, they are equivalent to finite automata (more on this later).

I will not try to give a comprehensive account of their syntax. For that, I refer readers to Perl information sites, regular expression whitepapers on the net, or the documentation for your particular regular expression library. Indeed, regular expressions come in many different flavours, each differing from the other in syntax, capabilities and the implementation method. This whitepaper is too small to cover regular expressions in their entirety. For that, I refer you to Mastering Regular Expressions by Jeffrey E.F. Friedl. The purpose of this document is to associate regular expressions with ordinary problems in software development and encouraging their use.

Regular Expressions have their own notation. Characters are things you can type. Operators are things in a RE that match one or more characters. You compose REs from operators, which in turn you specify using one or more characters. Theory of computation literature uses the operators listed in Table 1.

Table 1
Symbol Stands for...
å one letter from the whole alphabet
x* zero or more repetition of x (where x is one alphabet from å)
x+ once or more repetition of x (where x is one alphabet from å)
+ set union (a+b, means any one of a or b)
- set difference (a-b means any one letter in a but not in b)
xy (juxtaposition) x followed by y

Suppose å={a,b}. The set å, containing the symbols a and b is known as alphabet. The set of all possible strings of symbols that can be written using symbols from this alphabet is called Kleene Closure. This is denoted by å*, informally meaning "the repetition of any symbol in å, zero or more times". The repetition of any symbol in å one or more times is written as å+ and is known as the set's positive closure.
So, using closures, we can see that the strings containing only as can be denoted by a* or (å-b)*. Strings starting with a can likewise be denoted by aå*. All these, that is å*, a* and (å-b)* are regular expressions. At times, we draw diagrams for these called finite automata. For å*, the diagram would be like the one show in Figure 1.


Figure 1: Finite automaton for å*

We always start at the marked state (q0 in this case), keep moving through the arrows using up one character each time we move through an arrow. We follow the arrow that matches the current input character. If we end up at a state (a state is one within the circle) that has a double circle, the input is accepted (as matching). Here, on matching any character, the state remains q0, which is an accepting state.

Let's see another example. For aå*, the finite automaton is shown in Figure 2.

Figure 2: Finite automaton for aå*

This means, we start at q0. On finding an a in the input string, we move to q1 and remain there until we keep finding the input letters. Since we end on state q1, the string belongs to aå*. If on start we do not find an a, we cannot reach q1, hence the string is not accepted. More on this later.

However, language implementations use slightly modified notation. A small list appears in Table 2.

Table 2
Symbol Stands for...
. any single character
x* x, zero or more times
x+ x, one or more times
x? x once, or not at all (optional x)
x{n} x exactly n times
x{n,m} x, at least n but not more than m times
x|y either x or y
xy x followed by y
(x) x as capturing group (more later)
[abc] one of a or b or c, same as a|b|c
[^abc] any character except a, b or c
[a-zA-Z] a to z or A to Z (inclusive)

All libraries may not allow all of above operators. In addition, some notation has caught on from Unix, as shown in Table 3.

Table 3
Symbol Stands for...
^ The beginning of a line
$ The end of a line

A RE can either match a string fully, or it can match a substring. For example, the RE g??e matches both game and acknowledgement. If you want to match only strings like game, you can use ^g??e$ or use the GNU Regexp function isMatch instead of getMatch.
Incase you want to use a RE operator as an ordinary character (for example, you want a * to appear in the string), you must precede it with a backslash. A backslash character ('\') quotes and makes literal, the next character.

Often, RE libraries also provide operators for some widely used character classes. A character class is a set of characters that is allowed to appear in a specific place in texts matching a RE. For example, the set of all letters, [a-zA-Z] is a character class. RE libraries differ in the character class operators that they provide but those shown in Table 4 are provided by most.
Table 4
Symbol Matches Same as
\d Digit characters [0-9]
\D Non-digit characters [^0-9]
\w Word characters [a-zA-Z_0-9]
\D Non-word characters [^a-zA-Z_0-9]
\s Whitespace characters characters [\f\n\r\t]
\S Non-space characters [^\f\n\r\t]
\b Word boundary  

So, instead of using [a-zA-Z_0-9]+ to match a word, simply use \w+. \b is a zero-length operator that checks for word boundary (change from \s to \w or reverse) without matching anything. To match hard in 'Hard steel is expensive.' but not 'Hardworking people are in demand', use Hard\b. Some examples of use of regular expressions are given in Table 5.

Table 5
Description RE (TOC notation) RE (Language notation)
The set of strings containing only 0s and 1s that end in three consecutive 1s (0+1)*111 (0|1)*111$
(0|1)*1{3}$
The set of strings containing only 0s and 1s that have at least one 1 0*1(0+1)* (0|1)*1(0|1)*
0*1(0|1)*
[01]*1[01]*
The set of strings containing only 0s and 1s that have atmost one 1 0*+0*10* 0*|0*10*
0*1?0*
String of any characters å* .*
The set of identifiers in Pascal {a,...,z,A,...,Z}({a,...,z,A,...,Z, 0,...,9,_})* [a-zA-Z]\w*
A line of 80 characters ååå ... å (80 times) .{80}
A string of 1s, having at least one 1 1+ 1+
A string of letters not containing any vowel (å-{a,e,i,o,u})* [^aeiou]*

Now that we have seen what regular expressions are, let us see what they can be used for. There are four possible uses of regular expressions:

  1. Searching:
    Let us say we want to search for all numbers within a character string. An example of the character string is:

    abcdaef12345fghfgh234eioutsrkplmn

    In this string, 12345 and 234 are to be extracted.
    To solve this problem, we need to search within the string using the regular expression [0123456789]+ or [0-9]+. I will give a code sample on how to do such searches using REs in Java later, while discussing the solution to the problem given at the beginning of this text.

  2. Replacing:
    Let us say we have some text having numbers with some numbers wrongly written as .123, instead of 0.123. Let us say we want to correct these. We can use the regular expression ([^0-9])(\.[0-9]+) (or, in short, (\D)(\.\d+)). The meaning of this RE is explained in Table 6.
    Table 6
    Part of RE Significance
    [^0-9] starts with a non-digit character
    (...) work as capturing group; explained later
    \. has a dot. The backslash signifies that we are using a dot as a character; it does not stand for any letter, as it might in a RE
    [0-9]+ has any number of digits afterwards (at least one is required)
    The part within parenthesis has a special meaning within regular expressions. It keeps track of the indices of the substring that matched a given group. That is, once the regular expression has been matched, the part of input that matched the regular expression within the parenthesis is remembered (stored in memory) and accessible through variables or function calls (depending on the library in use). In Perl, this can be accessed through the $1 variable for the first group within braces, $2 for the next pair of braces and so on. $& is always the completely matched string, that matches the whole expression. In Java, there are functions to access these groups depending on the library used. So, using the RE library, we now have to replace the string found with $10$2, where $1 would contain the non-digit character found and $2 the digits and decimal point. (With some systems, $10$2 might be interpreted to mean the text of the tenth capture group, followed by the second. In such systems, you might have to right this as $1 . "0" . $2.)
    Since this is a widely used application of regular expressions, let's do another example.
    We have a file pathname something like "C:\Windows\desktop\abc.txt" (not including the quotes) and we need to extract only the filename, "abc.txt". Using procedural logic, we would have to scan the string from the end, looking for a backslash and then create a new string copying from that index to the end. Instead, using REs, it boils down to just a single line:
    (here I use C# RE syntax; it is similar for other languages)

    string text = @"C:\Windows\desktop\abc.txt";
    string pattern = @"^.*\\";
    string result = Regex.Replace(text,pattern,"");

    The RE ^.*\\ means "start at the beginning and go on looking until you find a '\' (the double backslash means that '\' is being used as a character, not as regular expression notation)". Since the * operator is greedy, it likes to absorb as many characters as possible. Greediness means that the + and the * operators try to match as many characters as possible. For example, consider the RE .*er. The .* can match b or berib in beriberi. However, because the * operator is greedy, it will match berib.
    Hence, the backslash that matches will always be the last one. The Replace statement then causes the matched part to be replaced by a blank, effectively deleting it from the string. Finally, the string 'result' will contain the filename 'abc.txt'.

  3. Parsing:
    Parsing means taking apart in order to understand something. Let us say, we have a typical URL:
    http :// www.xyz.com /doc/public /xxx.html
    1   2 3 4
    In this, -1- is a protocol, -2- is the name of a server, -3- is a path and -4- is the name of a document. Suppose we want to write a program that takes a URL and returns the protocol used, the DNS name of the server, the directory and the document name. We can do this using a RE as:

    ^(ftp|http|file)://([^/]+)(/.*)?(/.*)

    It says, start at the beginning, look for a protocol (one of ftp, http or file as denoted by ftp|http|file), look for ://, look for DNS name (as in ([^/]+)), then for an optional path (as denoted by the question mark after (/.*)?) and thereafter, for a document name (/.*). Convince yourself that the regular expression will do what is required and remember the four required values as $1, $2, $3 and $4(in Perl).

    Let us say, we enter the URL given above. The values returned are:
    $1 = http
    $2 = www.xyz.com
    $3 = /doc/public
    $4 = /xxx.html
    If we enter a URL that does not have a path, like,
    http://www.xyz.com/a.html

    the RE still works (because of the question mark) and returns:
    $1 = http
    $2 = www.xyz.com
    $3 = null
    $4 = /a.html

    Or, if the document is also not given, as in,

    http://www.xyz.com/

    then $4 contains just the slash. Note that this regular expression works only on VALID inputs. If the URL entered is invalid, it churns out invalid results (a case of GIGO). I leave to the reader as an exercise to write a RE that validates and parses the URL at the same time.

  4. Validation:
    We discuss the age old problem of date validation. A date is something like 23/4/2002 or 23-04-2002. For the day part, ([0-3]{0,1}[0-9]) would suffice. That is, an optional 0/1/2/3 followed by a digit. Instead of [0-3]{0,1}, we could also have used [0-3]?. Now, say, we want to allow both dash(-) separated and slash(/) separated dates. So, we write the next part as: /([01]{0,1}[0-9])/ or \-([01]{0,1}[0-9])\- (the slash indicates that the '-' is being used literally, not as RE notation). Please note that I have put a slash at the start and end instead of saying [/\-] (that is, slash or dash) each time. This is to ensure that dates like 21-3/97 are not accepted. Similarly, the year part is encoded. The full expression is given in the procedure below. Instead of using (...), we use (?:...) when we do not want the part that matched that within the parenthesis to be remembered. For example, instead of writing housecat|housekeeper, we can say house(?:cat|keeper). Here, the parenthesis is only for easy organization, not as capture group. The procedure is given in Listing 1.
    /* Listing 1. This uses GNU Regexp library to parse dates */
    public Date parse(String date)
    {
            String strRe=new String("([0-3]{0,1}[0-9])" + 
                                    "(?:(?:/([01]{0,1}[0-9])/)" + 
                                    "|(?:\\-([01]{0,1}[0-9])\\-))" + 
                                    "((?:19|20)[0-9]{2})");
                                         // the regular expression
            RE exp=null;
            try {
                    exp = new RE(strRe); // create a regular expression
            }
            catch (REException e) {
            	// cannot happen for this regexp
            }
            REMatch rem=exp.getMatch(date);
                                         // see if matches
            if (rem!=null) 	{
                    int dd,mm,yyyy;
                    String tempMm;
                    try {
                            dd=Integer.parseInt(rem.toString(1));
                                         // get the first capture group value
                            tempMm=rem.toString(2);
                                         // second and third
                                         // only one of second and third is
                                         // valid, and that is the month
                            if (tempMm.equals(""))
                                    mm=Integer.parseInt(rem.toString(3));
                            else
                                    mm=Integer.parseInt(tempMm);
                                         // fourth
                            yyyy=Integer.parseInt(rem.toString(4));
                    }
                    catch (NumberFormatException nfe) {
                            return null;
                    }
                    Calendar cal = new GregorianCalendar();
                                         // create the date object and return
                    cal.set(yyyy,mm-1,dd);
                    return cal.getTime();
            }
            else
                    return null;
    }
    

Note that certain improvements are in order. For example, the RE allows day-of-month to be 39. Although, such values are rejected later during processing, we can change the day part to [0-2]{0,1}[0-9]|30|31. Similarly, the other parts can be made stricter, for example the month part can be changed to exclude months like 00 or 19. These changes are left to the reader.
This procedure runs only five times slower than an equivalent that does not use regular expressions. Moreover, with newer approaches to regular expressions being developed, it will reduce even further.

Having seen regular expressions being used, let us turn our attention to how they work. That is, how is a RE engine (as part of a RE library) able to accept or reject strings? There are two approaches:

  1. DFA approach
  2. Procedural approach
Let us understand these.

DFA approach:
Somewhere in the beginning of this text, I showed you finite automata diagrams. For every RE, we can construct a Deterministic Finite Automata (DFA) and vice-versa. A DFA is nothing but a kind of program to find out if the RE matches some text or not. This kind of engine first constructs a Non-Deterministic Finite Automata (NFA) equivalent to the RE. From this NFA, it constructs a DFA equivalent and executes the DFA on the input string.
Let us do an example on this. We want to detect all strings containing only as and bs and ending with abb. We present the regular expression (a|b)*abb$ to the engine. It constructs an NFA as shown in Figure 3.

Figure 3: Non-deterministic finite automata

However, the computer cannot execute this directly because it cannot decide between moving to q1 and staying at q0 when it encounters an a at q0. Hence, the computer translates this into the DFA as per Figure 4.


Figure 4: Deterministic finite automata

This can be easily executed. The way this done for input string abaababb is in Table 7.

Table 7
Input absorbed State
- q0
a q1
b q2
a q1
a q1
b q2
a q1
b q2
b
We have a match whenever the state reaches q3.

Procedural approach:
This approach uses a procedure to decide if the input matches or not. Let us say it has to match a string to (a|b)*abb. First, it will match the first character (say, a) with the expression. (a|b)* accepts that. Similarly matching, it goes to the end of the input string matching the as or bs, and with (a|b)* accepting all. Now, it sees the abb part of the RE and decides to back off in the input string until it has found an a, then looks for bs after it and so on. Sometimes, this approach is also called NFA approach (because it is 'opposed' to DFA approach) but since it does not actually use a NFA, I do not like to call it by that name.
The advantage of DFA approach is that it is very fast. Most of the time, the speed difference is small. However, if the RE is something like (a|b+)* (that is, the RE contains nested quantifiers; which we should anyway avoid), the difference can be huge, especially is the pattern does not match (because all possibilities will be tried before giving up). Also, if a RE has to match multiple times (as for example searching for matches in a set of files), the saving can be significant.
However, the Procedural approach is easier to implement and can provide more features than DFA approach. For example, it is not easy to implement capture groups (using parenthesis within RE, as shown) in DFA approach. The Unix 'Awk' tool uses DFA approach while Perl uses Procedural approach. However, these days most tools use a mixed logic to do the matching. They might use DFA approach to check if a match is there, and then use Procedural approach to get the values of capture groups. We are also seeing implementations of control groups in DFA approach.
However, there is one more key difference in the two approaches. Sometimes, the same regular expression can be rearranged to produce different results with a Procedural engine. Consider the RE t(e|es). and the input string test. A DFA based engine will always give the matched string as test, but the Procedural engine will beg to differ. It will see the (e|es) part and will consider the e first. That is, it will logically look at the RE first as te. and will match tes, not test. Since there is a match, the engine will never backtrack and consider the other option es. Had there been no overall match, it would have considered matching es, seen the RE as tes. and matched test. If the RE is rearranged to t(es|e)., the Procedural engine would also match test. The DFA engine always returns the longest match possible. Period.

Now, we come to the solution to the discussed problem. I will give a Java program as an answer and it will use RE library built into JDK1.4. However, the program can be made to run on any Java version using other libraries with minimal changes. The RE we use is something like:


[a-z0-9\.\-\_]+@([a-z0-9\-\_]+\.)+(com|net|[a-z]{2})

Since we are using a case insensitive match, we do not need to add A-Z. Also, in the actual program, we include all other possible domain names ORed with COM and NET. To allow country codes like IN for India, US for USA etc., we use [a-z]{2}. Convince yourself that the rest of the expression is correct. The program is shown in Listing 2.
// Listing 2. Uses RE to detect email IDs in files
// Requires Java 1.4 or above

import java.util.regex.*;
import java.io.*;

public class Emails {
	
	public static void main(String args[]) {
		if (args.length < 1) {
			System.out.println("USAGE: java Emails filename [-comma]");
			System.exit(1);
		}

		String strRe=new String("[a-z0-9.\\-_]+@([a-z0-9\\-_]+\\.)+(" +
					"com|net|org|edu|int|mil|gov|arpa|biz|" + 
					"aero|name|coop|info|pro|museum|tv|([a-z]{2}))");
				// the regular expression
				// 'tv' can be omitted because it is 
				// covered under [a-z]{2}
					
		Pattern p = Pattern.compile(strRe,Pattern.CASE_INSENSITIVE | 
		                            Pattern.UNICODE_CASE | Pattern.MULTILINE);
				// compile it
					
		String content=null;
				// read the input file into content
		boolean comma=false;
		try {
			FileInputStream fin=new FileInputStream(args[0]);
			StringBuffer sb=new StringBuffer();
			int ch;
			while ((ch=fin.read())!=-1)
				sb.append((char)ch);
			fin.close();
			content=new String(sb);
		}
		catch (Exception e) {
			System.out.println("Cant read file.");
			System.exit(1);
		}
    
		if (args.length>=2) {
			if (args[1].equalsIgnoreCase("-comma"))
				comma=true;
		}
		
		boolean theEnd = false, printed = false;
		Matcher m = p.matcher(content);
				// get the results and print
		
		while (!theEnd) {
			theEnd = !m.find();
			if (!theEnd) {
				if (comma)
					if (!printed)
						System.out.print(content.substring(
						                 m.start(),m.end()));
					else
						System.out.print(", "+content.substring(
						                      m.start(),m.end()));
				else
					System.out.println(content.substring(
					                              m.start(),m.end()));
				printed=true;
			}
		}
	}
}

For those who do not have access to Java or RE library, I also present an Awk program for the same purpose. It can be invoked as awk -f email.awk <filename> where email.awk contains the following program and <filename> is the name of file in which to search for email IDs.

{
        r="[a-z0-9\\.\\-\\_]+@(([a-z0-9\\-\\_])+\\.)+(com|net|org|edu|int|mil|gov" + 
                              "|arpa|biz|aero|name|coop|info|pro|museum|tv|[a-z]{2})";
        a=$0;
        while (length(a)!=0) {
        	if (match(toupper(a),toupper(r))) {
        		print substr(a,RSTART,RLENGTH);
        	}
        	else
        		break;
        	a=substr(a,RSTART+RLENGTH,length(a));
        }
}

Now, we come to irregular expressions. Throughout the text, we have been saying "regular expressions". You might have been wondering if there are any irregular expressions also. Well, there are. As an example, the set:

{anbn: n>0, an means a repeated n times}
That is, the set of all strings like {ab,aabb,aaabbb,...} (having the same no of bs following some number of as). This is irregular in the sense that all the members cannot be represented by a single RE. However, using a set of RE operations, we can still validate members of this set. While discussing the solution, I will uncover one more feature of REs.

Backreferences are used within REs to refer to "what has already been matched". A backreference matches a specified preceding group. The backreference operator is represented by \digit anywhere after the end of a RE's digit-th group. However, they are not supported by all RE libraries. The notation used is \1 for the 1st capture group, \2 for the second and so on. Say, we want to match strings of the form x@x where x is a string. That is, aaa@aaa is valid but not aaa@bbb. The RE for this is (.*)@\1. That is, anything before the @ is stored as first capture group (accessible in Perl through $1, as discussed) and that same thing should be repeated after the @ (indicated by \1). If the group matches a substring, the backreference matches an identical substring. If the group matches more than once (as it might if followed by, e.g., a repetition operator), then the backreference matches the substring the group last matched. We would use this feature to accept (or reject) strings of the form anbn. However, we cannot use backreferences directly as (a+)\1 because we need to match bs in the second half instead of as in the first. Hence, we must somehow change the input so that it can be checked. Here, we make this transformation:

  1. Replace an occurrence of ab in the input with axa.
  2. Replace all bs with as.
Hence, on being given aabb, after the first transformation, it becomes aaxab and after second, aaxaa. Both these transformations are done through REs. Now, we can simply check using the RE ^(a+)x\1$. A complete JavaScript function to check such strings forms all of Listing 4.
// Listing 4
// Javascript function to test if
// a^nb^n (n>0) is irregular
// Which means the set containing strings having any number 
// of a's followed by the same number of b's cannot be 
// represented by a single regular expression. Here we try to 
// do it through two regular expressions without using any 
// programming logic
function doValidate(id)
{
        var reg1,reg2,reg3;
        reg1 = /ab/; 
        reg2 = /b/g;
        reg3 = /^(a+)x\1$/;

        if (reg3.test(id)) {            // this test is important
                alert("Invalid.");      // if the expression is of the
                return;                 // form a^n.x.a^n right from
        }                               // the start, then it will match.

        id = id.replace(reg1,"axa");

        id = id.replace(reg2,"a");
        // now the expression is of the form a^n.x.a^n, which is regular

        if (reg3.test(id))
                alert("Valid.");
        else
                alert("Invalid.");
}

Conclusion
We saw that regular expressions provide us with an easy way to manipulate text. The operators +, * etc. can be employed to write REs with complex matching capabilities. REs can be used for searching, replacing, parsing and validating strings. REs can be DFA or procedure based, each of which has its own advantages and disadvantages. While a DFA based engine is faster, procedural engines provide more flexibility. REs are supported by many tools and languages. Even irregular matching requirements can be handled easily using multiple REs.

Happy Regular Expression(ing)...