Regular Languages

A regular language is a language that can be expressed with a regular expression or a deterministic or non-deterministic finite automata or state machine. A language is a set of strings which are made up of characters from a specified alphabet, or set of symbols. Regular languages are a subset of the set of all strings. Regular languages are used in parsing and designing programming languages and are one of the first concepts taught in computability courses. These are useful for helping computer scientists to recognize patterns in data and group certain computational problems together — once they do that, they can take similar approaches to solve the problems grouped together. Regular languages are a key topic in computability theory.

Regular Languages

Regular languages and finite automata can model computational problems that require a very small amount of memory. For example, a finite automaton can generate a regular language to describe if a light switch is on or off, but it cannot keep track of how many times the light was switched on or off.

The machine above can know what state it is currently in, on or off, but it doesn't know what the history of flips was to get it to where it is (i.e. it doesn't know how many times the switch has been flipped on or off).

Finite automata generate or model regular languages. This means that regular languages can be described by a simple state machine diagram. A finite state machine, \(M,\) describes a given language, \(L\). \(M\) is said to accept a string \(w\) if the machine starts in a start state, undergoes some series of state transitions, and ends up in an accepting state. We say that the machine \(M\) recognizes the language \(L\) if \(M\) accepts all strings \(w\) that are in \(L\). A language is a regular language if there is a finite automaton that recognizes it.

For example, this machine recognizes the language of strings that have an even number of zeroes since any string that has an even number of zeroes will go from the start state to an accepting state.

Operations on Regular Languages

A regular language can be represented by a string of symbols and operations.

Concatenation

Concatenation is an operation that combines two symbols, sets, or languages. There are two main ways to write a concatenation. \( X \circ Y\) and \(XY\) both mean “X concatenated with Y.” Languages can also be concatenated: \(L_1 \circ L_2\) means strings from \(L_1\) are written first, and then strings from \(L_2\) come after.

Formally, concatenation is defined as follows: \(A \circ B = \{xy | x \in A \text{ and } y \in B\}\).

Union

Union is an operation where a finite state machine can make one choice or another at any step (this can also be thought of as an “or” operation). For example, there could be a finite state machine that can choose between concatenating \(X\) with \(Y\) or concatenating \(X\) with \(Z\). Or is written with a vertical bar, and the expression from the previous sentence looks like this: \((X \circ Y | X \circ Z)\).

Formally, union is described as follows: \( A \cup B = \{x | x \in A \text{ or } x \in B\}\). Note, here the vertical bar means “such that.”

Kleene Star

The Kleene star is an operation denoted by an asterisk \(*\) that basically represents repeated self-concatenation. For example, if a language \(L\) represents all strings of the form \(X^*,\) then this language includes \(X\), \(XX\), \(XXXXXX\), \(XXXXXXXXXXX\), etc. The Kleene star can repeat the symbol it operates on any number of times.

The regular language \(X \circ Y | Z^*\) represents strings of the form “X concatenated with Y or Z repeated any number of times.”

Formally, the Kleene star is defined as follows: \(A^* = \{x_1x_2...x_k | k \geq 0 \text{ and each } x_i \in A\}\). The Kleene star is a unary operation since it operates only on one thing instead of two (unlike union and concatenation, which are binary operations).

Because \(k\) can be \(0\), \(x\) can be repeated \(0\) times. This means that the empty string is always a member of \(A^*\).

The Empty String

Another important symbol is \(\epsilon\) which represents the empty string.

Language Notation

Languages are often written as follows: \(L = \text{(some pattern)} \big| \text{(some condition)}, \) where the vertical bar means "such that." The pattern often resembles something like this \(0^n1^n,\) where the exponent actually represents a unary operation — the \(0\) would be repeated \(n\) times. For example, \(0^41^4\) is written out as \(00001111\).

Translating between Regular Languages, Finite Automata, and Regular Expressions

Finite automata and regular expressions are different ways to represent regular languages.

Finite automata can be used to generate strings in a regular language. A finite automaton for a particular language is “programmed,” in a way, to generate the strings of a given language through its states and transition functions. You can walk through a finite state machine to see what strings are able to be made and therefore are part of the language the machine described, or you can feed it an input string to see if a given input can be made by the machine.

Note: The symbol epsilon, \(\epsilon\), represents transitions on the empty string, sometimes called a "null transition." This means that the machine can take these transitions without needing to read a particular symbol on the input.

A regular expression is one way to represent a regular language as a string. For example, the regular language described by the regular expression \(0^* 1 \big| 1^*0\) means strings that either contain any number of 0’s followed by a single 1 or any number of 1’s followed by a single 0. This regular expression can be represented by the following finite state machine:

The regular expression \((01)^* 11 (10)^*\), which is the language of strings starting with any number of “01” substrings, followed by two 1’s and then any number of “10” substrings, can be represented with the following finite state machine:

What is the regular expression that describes this finite state machine?

^[1]

The start state is \(q_0\) and the accepting state is \(q_1\).

From \(q_0\) we can generate a 1 by looping back in on \(q_0\), and we can do this however many times; we can generate a 0 by looping back in on \(q_0\), and we can do this however many times; or we can generate a 0 by transitioning to \(q_1\). Once we are in \(q_1\), we can generate any number of 0's by looping back in on \(q_1\).

Therefore, the regular expression that describes this is \((0|1)^*00^*\).

Formal Definition and Proving that a Language is Regular

The set of regular languages over a given alphabet, \(\Sigma\) is defined recursively as follows^[2]:

The empty language and empty string are regular languages.

For every \(x \in \Sigma\) the singleton language \(\{a\}\) is a regular language.

The closure rules described in the next section apply to regular languages.

No other languages over \(\Sigma\) are regular.

To prove if a language is a regular language, one can simply provide the finite state machine that generates it. If the finite state machine for a given language is not obvious (and this might certainly be the case if a language is, in fact, non-regular), the pumping lemma for regular languages is a useful tool. To see nonregular languages, check out the kinds of languages that are recognized by context-free grammars or Turing machines — both of these recognize regular languages and more.

The Pumping Lemma For Regular Languages:

The idea behind the pumping lemma for regular languages is that there are certain constraints a language must adhere to in order to be a regular language. You can use the pumping lemma to test if all of these constraints hold for a particular language, and if they do not, you can prove with contradiction that the language is not regular.

The pumping lemma for regular languages states that if a language \(L\) is regular, there exists some integer pumping length \(p \geq 1\) such that every string \(s \in L\) with a length of \(p\) or more symbols, \(|s| \geq p\), that can be written \(s = xyz,\) where \(x, y, \text{ and } z\) are substrings of \(s\) such that

\(|y| \geq 1\)

\(|xy| \leq p\)

\(\text{for all } i \geq 0,\, xy^iz \in L. \)

All regular languages are “pumpable,” meaning that the pumping lemma for regular languages constraints hold true for all regular languages. If a language is not pumpable, then it is not a regular language. However, if a language is pumpable, it is not necessarily a regular language.

Essentially, the pumping lemma holds that arbitrarily long strings \(s \in L\) can be pumped, or have segments repeated, without ever producing a new string that is not in the language \(L\).

To prove that a language is not regular, use proof by contradiction and the pumping lemma. Set up a proof that claims that \(L\) is regular, and show that a contradiction of the pumping lemma’s constraints occurs in at least one of the three constraints listed above.

Use the pumping lemma to prove that \(L = \{a^nb^n | n > 0\}\) is not a regular language.

Assume, for the sake of contradiction, that \(L = \{a^nb^n| n > 0\}\) is a regular language. By the pumping lemma, there exists an integer pumping length \(p\) for \(L\). We need a string \(s\) that is longer than or equal to the length of \(p\). Certainly \(s = a^pb^p\) is longer than \(p\), so we choose this for the \(s\) string. This \(s\) is in \(L\) since it has \(p\) \(a\)’s and \(p\) \(b\)’s.

Now by the pumping lemma, \(|y| \leq p\). There are three possible places in the string that we can assign to be \(y\):

\(y = a^j\) for some \(j \leq p\). This means that \(y\) is contained purely in the \(a\)’s section.

\(y = b^j\) for some \(j \leq p\). This means that \(y\) is contained purely in the \(b\)’s section.

\(y = a^jb^k\) for some \(j\) and \(k,\) where \(j+k \leq p.\) This means that the \(y\) segment is contained somewhere in the \(a\)’s and \(b\)’s section.

In any of these three cases, we can easily verify that the third constraint for the pumping lemma, \(xy^iz \in L\ \forall i \geq 0 \), does not hold. In other words, for any of these three choices of \(y\), the string \(s\) cannot be pumped in a way that results in a string that has an equal number of \(a\)’s and \(b\)’s \((\)the definition of the language \(L).\)

Let’s take a short example string described by \(a^5b^5 = aaaaabbbbb\) and \(p = 3\).

In the first case, there will be more \(a\)’s than there are \(b\)’s, making the resulting, pumped string not a member of \(L\). If we pump this region, we will get the string \(aaaaaaaabbbbb:\) a string with \(8\) \(a\)’s and \(5\) \(b\)’s. Clearly, this is not in the language. A similar proof can be checked for the second case; just pump the \(b\) region instead of the \(a\) region.

For the third case, we do something similar. We just need to find one example of pumping in this region that violates the third property of the pumping lemma since if the language is regular, it needs to be pumpable for all \(i\). If we pump \(a^5b^5 = aaaaabbbbb\) and \(p = 3\) and pump the last \(a\) in the \(a\) section and the first two \(b\)’s in the \(b\) section, we get this string: \(aaaaaabbbbbbb\) — a string with six \(a\)’s and seven \(b\)’s, which is clearly not in the language of strings with an equal number of \(a\)’s and \(b\)’s.

Closure Properties

Regular languages have the following closure properties. A set is closed under an operation if doing the operation on a given set always produces a member of the same set. This means that if one of these closed operations is applied to a regular language, the result will also be a regular language.

Union: Regular languages are closed under the union operation. This means that if \(L_1\) and \(L_2\) are both regular languages, then \(L_1 \cup L_2\) is also a regular language.

Since \(L_1\) and \(L_2\) are both regular languages, by definition there is some finite state machine that describes each of them; call these \(M_1\) and \(M_2\).

If we make a machine \(M\) from \(M_1\) and \(M_2\), the machine will accept a given input string \(w\) if either \(M_1\) or \(M_2\) accepts \(w\). Since \(M\) simulates \(M_1\) and \(M_2,\) \(M\) will accept if either \(M_1\) or \(M_2\) accepts, which means it will accept the union of \(L_1\) and \(L_2\).

Here is what such a machine \(M\) would look like:

^[3]

Intersection: Regular languages are closed under the intersection operation. This means that if \(L_1\) and \(L_2\) are both regular languages, then \(L_1 \cap L_2\) will also be a regular language.

Just like in the union closure proof, construct a machine \(M\) that is made up of \(M_1\) and \(M_2\) which accept \(L_1\) and \(L_2\), respectively. \(M\) runs both the \(M_1\) and \(M_2\) machines in parallel and accepts if both \(M_1\) and \(M_2\) accepts.

^[4]

Concatenation: Regular languages are closed under the concatenation operation. This means that if \(L_1\) and \(L_2\) are both regular languages, then \(L_1 \circ L_2\) will also be a regular language.

This proof is similar to the proof for closure under union. First, take \(M_1\) which accepts \(L_1\) and \(M_2\) which accepts \(L_2\). How can we hook up these two machines into a machine \(M\) that accepts \(L_1 \circ L_2?\)

In union, we connected the finite state machines in series, and with an intersection, we connected the machines in parallel. Since the concatenation of \(L_1\) and \(L_2\) means we put strings from \(L_1\) before strings from \(L_2\), we have to combine the machines in such a way that all/any of the \(L_1\) strings are produced before we get to the machine that will produce strings from \(L_2\). We can accomplish this by hooking up \(M_1\) and \(M_2\) like this using a nondeterministic finite automaton:

^[4]

Kleene star: It is easy to see that regular languages are closed under the star operation since concatenation is closed and the star is just repeated self-concatenation.

^[3]

Complement: Regular languages are closed under complement. This means that if \(L\) is a regular language, then \(\overline{L}\) is also a regular language.

Take the state diagram for \(L\) and change all of the accepting states to rejecting states and all of the accepting states to rejecting states. We have just made a new finite state machine that accepts \(L\)’s complement. Since a finite state machine accepts \(\overline{L}\), it is a regular language.

^[3]

Relationship with other Computation Models

Regular languages are less powerful than context-free languages which are generated by context-free grammars and pushdown automata, and recursively enumerable languages which are generated by Turing machines.

^[5]

References

, A. Buchi non deterministic example.svg. Retrieved June 19, 2016, from https://en.wikipedia.org/wiki/File:Buchi_non_deterministic_example.svg
, . Regular Language. Retrieved June 21, 2016, from https://en.wikipedia.org/wiki/Regular_language
Aaronson, S. 6.045J Automata, Computability, and Complexity: Lecture 4, Spring 2011. Retrieved June,21, 2016, from http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011
Aaronson, S. 6.045J Automata, Computability, and Complexity: Lecture 4, Spring 2011. Retrieved June, 21, 2016, from http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011
Fitch, W., & Friederici , A. Artificial grammar learning meets formal language theory: an overview. Retrieved June 19, 2016, from http://rstb.royalsocietypublishing.org/content/367/1598/1933

Contents