### Formal Languages, part 2: Regular Expressions and Regular Languages

Before I start in on this entry, I need to give credit where credit is due and note that my main reference source for these blog posts on formal languages is my grad school textbook,Ā *An Introduction to Formal Languages and Automata, Third Edition* by Peter Linz (the 5th edition is linked). If you’re out there, Peter, thanks for writing an awesome intro to this nerdy topic, and congratulations on having a great first name. š

So let’s talk about regular expressions. Again, we’re not talking about the Perl version with lazy matching and look-ahead and look-behind and all that magic; we’re discussingĀ austere, formalized regular expressions. The idea is that a regular expression matches a set of strings, and those strings make up aĀ **regular language**. (This is jumping the gun a bit because we actually define regular languages using **finite automata**, but we’ll come back to that later.)Ā For formality’s sake, we first note that the empty set ā
and the empty string Ī» themselves are regular expressions; the former is, of course, a set of zero sentences, while the latter is a set of one sentence, namely the empty string. In addition, any memberĀ *a* in a given alphabet Ī£ is also a regular expression of a single sentence one symbol long. We then add on concatenation (*r _{1}*Ā·

*r*, or more simply,

_{2}*r*), alternation (

_{1}r_{2}*r*+

_{1}*r*, which is expressed in Unix and Perl REs as

_{2}*r*|

_{1}*r*, and means “

_{2}*r*or

_{1}*r*“) and star-closure (

_{2}*r**, i.e. zero or moreĀ

*r*s in a row), where theĀ

*r*s are composed of theĀ

*primitive*regular expressionsĀ ā ,Ā Ī», and allĀ

*a*Ā ā Ī£. (We almost always ignore the ā and Ī», however, and just talk about combinations of symbols in the alphabet Ī£.) We can also group sub-expressions with parentheses. So this way, we can say things like “a language with an even number ofĀ

*a*s followed by an odd number ofĀ

*b*s”, which would be expressed as:

*r* = (*aa*)*(*bb*)**b*

To break that down, we first see thatĀ (*aa*)* represents any number of *aa*s: 0, 1, 2, etc., and so gives us an even number of *a*s. The second part, (*bb*)**b*, indicates that we again have an even number of *b*s, but then there is one extra *b* at the end, giving us an odd number. We can give many examples of regular expressions: “all bit stringsĀ (i.e. strings of 0s and 1s) which has at least one pair of consecutive 0s”, “all strings on an alphabet such that the length is divisible by 3”, “all bit strings that, when interpreted as a binary number, is greater than or equal to 39”, and so on. So it seems like we can define quite a few complicated languages using just this simple way of expressing patterns! However, remember the example from part 1:

/^1?$|^(11+?)\1+$/

This isn’t a true regular expression, as we noted – you can’t do things in regular expressions like “lazy matching” or “back-referencing”. All we can say is “this followed by that”, “this or that” and “zero or more thises”. But maybe it’s possible that thereĀ *is* a regular expression that does what the above does, namely match all composite numbers expressed in unary notation. Try if you’d like, but you’ll soon see it’s futile. In another post, I’ll explain the goofily-namedĀ **pumping lemma** which is used to prove that a particular grammar does not represent a regular language, and thus cannot be represented by a regular expression. We’ll then use the pumping lemma to prove that we can’t construct a regular expression for this language.

I think that’s enough for now. Next time, we’ll discuss finite automata and how they’re used to truly define regular languages. As a preview, we’ll see that there are two types:Ā *deterministic* automata which require that you always know what to do next, andĀ *non-deterministic* automata which allows “guesses” as to how to proceed. The magic part, as we’ll see, is that they are equivalent! But you’ll have to come back later to see what I’m talking about. š