The Lutheran Geek

The life and times of a WoW-playing, Java-programming dude in Chicago

Wednesday, February 25, 2015

Formal Languages, part 2: Regular Expressions and Regular Languages

Before I start in on this entry, I need to give credit where credit is due and note that my main reference source for these blog posts on formal languages is my grad school textbook,Ā An Introduction to Formal Languages and Automata, Third Edition by Peter Linz (the 5th edition is linked). If you’re out there, Peter, thanks for writing an awesome intro to this nerdy topic, and congratulations on having a great first name. šŸ™‚

So let’s talk about regular expressions. Again, we’re not talking about the Perl version with lazy matching and look-ahead and look-behind and all that magic; we’re discussingĀ austere, formalized regular expressions. The idea is that a regular expression matches a set of strings, and those strings make up aĀ regular language. (This is jumping the gun a bit because we actually define regular languages using finite automata, but we’ll come back to that later.)Ā For formality’s sake, we first note that the empty set āˆ… and the empty string Ī» themselves are regular expressions; the former is, of course, a set of zero sentences, while the latter is a set of one sentence, namely the empty string. In addition, any memberĀ a in a given alphabet Ī£ is also a regular expression of a single sentence one symbol long. We then add on concatenation (r1Ā·r2, or more simply, r1r2), alternation (r1 + r2, which is expressed in Unix and Perl REs as r1 | r2, and means “r1 or r2“) and star-closure (r*, i.e. zero or moreĀ rs in a row), where theĀ rs are composed of theĀ primitive regular expressionsĀ āˆ…,Ā Ī», and allĀ aĀ āˆˆ Ī£. (We almost always ignore the āˆ… and Ī», however, and just talk about combinations of symbols in the alphabet Ī£.) We can also group sub-expressions with parentheses. So this way, we can say things like “a language with an even number ofĀ as followed by an odd number ofĀ bs”, which would be expressed as:

r = (aa)*(bb)*b

To break that down, we first see thatĀ (aa)* represents any number of aas: 0, 1, 2, etc., and so gives us an even number of as. The second part, (bb)*b, indicates that we again have an even number of bs, but then there is one extra b at the end, giving us an odd number. We can give many examples of regular expressions: “all bit stringsĀ (i.e. strings of 0s and 1s) which has at least one pair of consecutive 0s”, “all strings on an alphabet such that the length is divisible by 3”, “all bit strings that, when interpreted as a binary number, is greater than or equal to 39”, and so on. So it seems like we can define quite a few complicated languages using just this simple way of expressing patterns! However, remember the example from part 1:

/^1?$|^(11+?)\1+$/

This isn’t a true regular expression, as we noted – you can’t do things in regular expressions like “lazy matching” or “back-referencing”. All we can say is “this followed by that”, “this or that” and “zero or more thises”. But maybe it’s possible that thereĀ is a regular expression that does what the above does, namely match all composite numbers expressed in unary notation. Try if you’d like, but you’ll soon see it’s futile. In another post, I’ll explain the goofily-namedĀ pumping lemma which is used to prove that a particular grammar does not represent a regular language, and thus cannot be represented by a regular expression. We’ll then use the pumping lemma to prove that we can’t construct a regular expression for this language.

I think that’s enough for now. Next time, we’ll discuss finite automata and how they’re used to truly define regular languages. As a preview, we’ll see that there are two types:Ā deterministic automata which require that you always know what to do next, andĀ non-deterministic automata which allows “guesses” as to how to proceed. The magic part, as we’ll see, is that they are equivalent! But you’ll have to come back later to see what I’m talking about. šŸ™‚

posted by Peter at 9:34 pm  

Sunday, February 15, 2015

Formal Languages, Part 1: Basic Concepts

Last week, there was a linkĀ posted on Hacker News that described a regular expression that only matches non-prime numbers (expressed as a string of “1”s, i.e. in so-called “unary” notation):

/^1?$|^(11+?)\1+$/

Commenters noted that this is not aĀ real regular expression. But what does that mean? Anyone with experience in Perl regular expressions can see that this clearly will work just fine; the linked article does a good job of breaking down how this goes about matching (or not matching) a given string of 1s. You see, the history of Unix has sort of muddied the term “regular expression” to the point that it means something in practice thatĀ is broader than the theory. Originally, the old Unix command line tool grep was designed with classical “regular expressions” in mind. However, as is well-known, classical regular expressions are very limited, and so a separate tool called egrep was written to accept so-called “extended regular expressions”, and eventually egrep‘s notion of an extended RE is what we now call a “regular expression”.

So what is a “classical” regular expression? To explore that question, we need to look into the field called “formal languages”. Despite the name, the field has little to do with programming languages or spoken languages, although it is used in theoretical studies of these kinds of language. Formal language study is the study of patterns, in a way. It asks: what is the simplest pattern which can describeĀ this (usually infinite) set of strings? So to understand what a regular expression is, we need to set down some definitions first that will allow us to talk about formal languages in general. Once we do that, we can talk about the family of so-called “regular languages” that regular expressions encompass, and go beyond that to define more powerful languages.

Basic concepts of formal languages

AĀ language is just a set ofĀ strings that are constructed from anĀ alphabet. An alphabet is simply a finite set of symbols such as ‘a’, ‘b’, ‘c’, etc. Traditionally, an alphabet is denoted with the symbol Σ. Strings are composed of zero or more symbols from the alphabet – the so-calledĀ empty string is traditionally denoted byĀ λ.

Since strings can contain many consecutive copies of a particular symbol, we can use a shorthand to represent repetition:

an = aaaaaa… (n times)

For n = 0, we define

a0 = λ

When starting with an alphabet Σ, it’s useful to talk about the possible strings it can make. Σ* is the set of all strings consisting of zero or more symbols from Σ. So say that our alphabet was very simple: it just consisted of {a}. In this case,

Σ* = { λ, a, aa, aaa, aaaa, … }

or, more succintly,

Σ* = { an : n ≥ 0 }

Σ* always contains λ; if we want to exclude it, we define

Σ+ = Σ* – { λ }

A language, then, is just a subset of Σ* – perhaps in a way we can define succinctly, perhaps not. A string that is a member of a language is called a sentence. Remember, these are formal, abstract terms, so don’t get too hung-up on the real meaning of “language” or “sentence” here.

Finally, we have the concept of “grammars”. Again, this isn’t like a grammar you’re used to, although it’s quite close! Sentences in various languages can only be constructed in certain ways; some (like English) are subject-verb-object, some (like Japanese) are subject-object-verb; even others use different combinations. So a basic grammar rule in English is expressed as:

[sentence] ⇒ [subject] [verb] [object]

Subjects and objects can be further defined as agglomerations of nouns and articles and such. In the same way, we define grammars with our languages and alphabets:

G = (V, T, S, P)

G is a grammar described as a combination of three sets and one element:

  • V is defined as a set of variables; as few as one. These are usually denoted with capital letters such as S, A, B, etc.
  • T is defined as a set of terminal symbols. This is almost always the alphabet Σ that we are already familiar with
  • S is a single member of V, referred to as the start variable; by convention, it’s usually just the letter S. This variable is assumed to be, naturally, the start of any derivation using our grammar.
  • P is a set of productions similar to our grammar rule above. Productions can contain any number of variables and/or terminal symbols on either side of the arrow, though we specify that the left side contains at least one symbol, while the right side can be empty (i.e. λ).

A language, thus, can be defined as all the sentences such that there is some chain of productions in P called a derivation that produces the sentences – we denote this language as L(G), or the language derived from the grammar G.

To wrap up, here’s a simple grammar:

G = ({S}, {a, b}, S, P)

where P is given by the two productions:

S ⇒ aSb
S ⇒ λ

We can derive the sentence “aabb” as so:

S ⇒ aSb ⇒ aaSbb ⇒ aabb

By applying the first rule 0 or more times, followed by the second rule, it’s pretty easy to see that this grammar defines a fairly simple language:

L(G) = { anbn : n ≥ 0 }

So now we can talk about formal languages. Next time, we’ll actually get into regular languages, although along the way we will have to detour and talk about “automata”. Robots? Well, not really. More like a robot that defines a language. šŸ™‚

posted by Peter at 8:22 pm  

Sunday, February 8, 2015

An engineer’s devotional

Back in 1998, I was a volunteer at Christ Lutheran Church in Albany Park. I worked there as a youth worker for a year through the Lutheran Volunteer Corps. Being right out of college, it was a great way to spend a year to start finding my way in the post-adult world, as well as to give back a little before I got consumed in the world of work. This past week, the pastor, Tom Terrell, contacted me and others asking us if we were willing to contribute entries for his congregation’s Lenten devotional. It was really great to hear from him, and considering what a crazy week this has been personally (more about that some other time), I accepted the opportunity. Here is my humble contribution.


O LORD, my heart is not lifted up,
my eyes are not raised too high;
I do not occupy myself with things
too great and too marvelous for me.
But I have calmed and quieted my soul,
like a weaned child with its mother;
my soul is like the weaned child that is with me.
O Israel, hope in the LORD
from this time and forevermore.

— Psalm 131 (NRSV)

These days, I work as a software engineer. I’m fortunate enough to work with a group of people who are very down to earth and wonderful people, but I’ve seen some of the darker sides of what people can become in this line of work: arrogant, dismissive, and forgetful of the fact that working with the people around them matters more than the programs they design. It can be a struggle to be a more gentle soul in this line of work when an unstated expectation of your job is that you should always assert yourself to the front of every discussion.

God wants us to be the best “us” we can be. God does not expect us to fit into one particular mold, no matter what our particular gifts are. The psalm I quoted above encourages me that God especially loves those who take the time to be quiet and try to understand: understand life, understand our role in the world, understand love, understand ourselves. God also does not leave us alone in this struggle, but is a partner with us, hand quietly on our shoulder as we go through life.

posted by Peter at 5:32 pm  

Powered by WordPress