The Lutheran Geek

The life and times of a WoW-playing, Java-programming dude in Chicago

Wednesday, February 25, 2015

Formal Languages, part 2: Regular Expressions and Regular Languages

Before I start in on this entry, I need to give credit where credit is due and note that my main reference source for these blog posts on formal languages is my grad school textbook,Ā An Introduction to Formal Languages and Automata, Third Edition by Peter Linz (the 5th edition is linked). If you’re out there, Peter, thanks for writing an awesome intro to this nerdy topic, and congratulations on having a great first name. šŸ™‚

So let’s talk about regular expressions. Again, we’re not talking about the Perl version with lazy matching and look-ahead and look-behind and all that magic; we’re discussingĀ austere, formalized regular expressions. The idea is that a regular expression matches a set of strings, and those strings make up aĀ regular language. (This is jumping the gun a bit because we actually define regular languages using finite automata, but we’ll come back to that later.)Ā For formality’s sake, we first note that the empty set āˆ… and the empty string Ī» themselves are regular expressions; the former is, of course, a set of zero sentences, while the latter is a set of one sentence, namely the empty string. In addition, any memberĀ a in a given alphabet Ī£ is also a regular expression of a single sentence one symbol long. We then add on concatenation (r1Ā·r2, or more simply, r1r2), alternation (r1 + r2, which is expressed in Unix and Perl REs as r1 | r2, and means “r1 or r2“) and star-closure (r*, i.e. zero or moreĀ rs in a row), where theĀ rs are composed of theĀ primitive regular expressionsĀ āˆ…,Ā Ī», and allĀ aĀ āˆˆ Ī£. (We almost always ignore the āˆ… and Ī», however, and just talk about combinations of symbols in the alphabet Ī£.) We can also group sub-expressions with parentheses. So this way, we can say things like “a language with an even number ofĀ as followed by an odd number ofĀ bs”, which would be expressed as:

r = (aa)*(bb)*b

To break that down, we first see thatĀ (aa)* represents any number of aas: 0, 1, 2, etc., and so gives us an even number of as. The second part, (bb)*b, indicates that we again have an even number of bs, but then there is one extra b at the end, giving us an odd number. We can give many examples of regular expressions: “all bit stringsĀ (i.e. strings of 0s and 1s) which has at least one pair of consecutive 0s”, “all strings on an alphabet such that the length is divisible by 3”, “all bit strings that, when interpreted as a binary number, is greater than or equal to 39”, and so on. So it seems like we can define quite a few complicated languages using just this simple way of expressing patterns! However, remember the example from part 1:

/^1?$|^(11+?)\1+$/

This isn’t a true regular expression, as we noted – you can’t do things in regular expressions like “lazy matching” or “back-referencing”. All we can say is “this followed by that”, “this or that” and “zero or more thises”. But maybe it’s possible that thereĀ is a regular expression that does what the above does, namely match all composite numbers expressed in unary notation. Try if you’d like, but you’ll soon see it’s futile. In another post, I’ll explain the goofily-namedĀ pumping lemma which is used to prove that a particular grammar does not represent a regular language, and thus cannot be represented by a regular expression. We’ll then use the pumping lemma to prove that we can’t construct a regular expression for this language.

I think that’s enough for now. Next time, we’ll discuss finite automata and how they’re used to truly define regular languages. As a preview, we’ll see that there are two types:Ā deterministic automata which require that you always know what to do next, andĀ non-deterministic automata which allows “guesses” as to how to proceed. The magic part, as we’ll see, is that they are equivalent! But you’ll have to come back later to see what I’m talking about. šŸ™‚

posted by Peter at 9:34 pm  

1 Comment »

  1. Barton Winders

    The Lutheran Geek

    Trackback by Wesley Pignataro — March 25, 2019 @ 11:37 am

RSS feed for comments on this post. TrackBack URI

Leave a comment

You must be logged in to post a comment.

Powered by WordPress