Formal Languages

Alphabet

An alphabet is a finite non-empty set of symbols. Sometimes these symbols are called letters and the alphabet is called a vocabulary. An alphabet is usually denoted by the symbol $\Sigma$ and the symbols in the alphabet are in lower case.

Example

$\Sigma = \{a, b, c\}$ is an alphabet with three symbols.
$\Sigma_{\text{bool}} = \{0, 1\}$ is an alphabet with two symbols representing boolean or binary values.
$\Sigma_{\text{lat}} = \{a, b, c, \ldots, z\}$ is an alphabet with all the lowercase letters of the Latin alphabet.
$\Sigma = \{0, 1, 2, \ldots, 9\}$ is an alphabet with the digits 0 to 9.
$\Bbb{N}$ is not an alphabet because it is not finite.
$\Sigma = \{\}$ is not an alphabet because it is empty.

Words

A word is a finite sequence of symbols from an alphabet also commonly called a string. Sequence elements are commonly seperated by commas. However in the context of formal languages, we often omit the commas and write the sequence of symbols without any seperators as long as no ambiguity arises.

The length of a word is calculated just like the length of a sequence, by counting the number of elements in the sequence, i.e. the number of symbols in the word.

Example

$w = a, b, c$ is a word with three symbols. $|w| = 3$ .
$w = abc$ is the same word as above but written without commas.
$w = word$ is a word over the alphabet $\Sigma_{\text{lat}}$ .
$w = 101$ is a word over the alphabet $\Sigma_{\text{bool}}$ .
$w = 101010$ is not a word over the alphabet $\Sigma_{\text{lat}}$ .

Empty Word

The empty word is a word with no symbols. It is often denoted by $\varepsilon$ or $\lambda$ . Just like the empty set is a subset of every set, the empty word is a word over every alphabet.

Because the empty word has no symbols, its length is zero, i.e. $|\varepsilon| = 0$ .

Number of Occurrences

The number of occurrences of a symbol in a word is the number of times that symbol appears in the word. This is often denoted by $|w|_a$ where $a$ is the symbol and $w$ is the word.

Example

$w = abacaba$ , $|w|_a = 4$ .
$w = 101010$ , $|w|_0 = 3$ .

Using this we can also define the length of a word as the sum of the number of occurrences of each symbol in the word:

|w| = \sum_{a \in \Sigma} |w|_a

Concatenating Words

Concatenating two words is the operation of joining the two words together to form a new word. This is often denoted by the symbol $\cdot$ . So we define the concatenation of two words $u$ and $v$ as follows:

u \cdot v = uv

This is just like in normal arithmetic where we can omit the multiplication sign when multiplying variables. When concatonating words the length of the new word is the sum of the lengths of the two words.

|u \cdot v| = |u| + |v|

This then also gives a logical definition of the concatenation of the empty word with any word. The concatenation of the empty word with any word is just the word itself. This is because the empty word has no symbols and no length, so adding it to any word does not change the word or its length.

Info

However unlike normal multiplication, the order of the words matters in concatenation, i.e. the concatenation operation is not commutative.

u \cdot v \neq v \cdot u

However, the concatenation operation is associative:

(u \cdot v) \cdot w = u \cdot (v \cdot w)

Example

$u = ab$ , $v = cd$ , $u \cdot v = abcd$ .
$u = 101$ , $v = 010$ , $u \cdot v = 101010$ .
$u = \varepsilon$ , $v = abc$ , $u \cdot v = abc$ .

Powers of Words

The power of a word is the concatenation of the word with itself a certain number of times. This is denoted by a superscript and also often called iteration or repetition of a word. The power $n$ of a word has to be a non-negative integer, $n \in \Bbb{N}_0$ . Just like with powers of numbers where $1$ is the identity element for the power of $0$ , the empty word is the identity element for the power of $0$ for words.

w^n = \prod_{i=1}^{n} w

Using the definition of concatenation, we can also write this as:

w^n = \underbrace{w \cdot w \cdot \ldots \cdot w}_{n \text{ times}}

Which also leads to a recursive definition of the power of a word:

w^n = \begin{cases} \varepsilon & \text{if } n = 0 \\ w \cdot w^{n-1} & \text{if } n > 0 \end{cases}

Example

$w = 101$ , $w^3 = 101101101$ .
$w = abc$ , $w^2 = abcabc$ .
$w = 101$ , $w^0 = \varepsilon$ .
$w = \varepsilon$ , $w^2 = \varepsilon$ .

Using the power of words we can write long sequences of the same symbol in a more compact way, this can however lead to ambigous words:

w = bbababababbaaaabab = b^2(ab)^4ba^4bab = b(ba)^4b^2a^3(ab)^2

Reverse of a Word

The reverse of a word is the word with the symbols in the opposite order. This is denoted by the symbol $w^R$ .

\begin{align*} w & = a_1a_2 \ldots a_n \\ w^R & = a_na_{n-1} \ldots a_2a_1 \end{align*}

Example

$w = 101$ , $w^R = 101$ .
$w = abc$ , $w^R = cba$ .
$w = 101010$ , $w^R = 010101$ .
$w = \varepsilon$ , $w^R = \varepsilon$ .

We can also define the reverse of the concatenation of two words as the concatenation of the reverse of the two words in reverse order:

(u \cdot v)^R = v^R \cdot u^R

Example

This is best shown by example:

\begin{align*} u & = ab \\ v & = cd \\ u \cdot v & = abcd \\ (u \cdot v)^R & = dcba \\ v^R \cdot u^R & = dcba \end{align*}

Sets of Words

If we have an alphabet $\Sigma$ , we can define some special sets of words over that alphabet. Specifically we can define the set of all words of a certain length by using the power of the alphabet.

\Sigma^n = \text {set of all words of length } n \text{ over the alphabet } \Sigma

This set is finite because the alphabet is finite and the length of the words is fixed.

Example

$\Sigma = \{a, b\}$ , $\Sigma^2 = \{aa, ab, ba, bb\}$ .
$\Sigma = \{0, 1\}$ , $\Sigma^3 = \{000, 001, 010, 011, 100, 101, 110, 111\}$ .
$\Sigma = \{a, b, c\}$ , $\Sigma^0 = \{\varepsilon\}$ .

Kleene Star

The Kleene star is a special set of words that includes all words of any length that can be made from the alphabet. It is the set of all words over the alphabet.

\Sigma^* = \bigcup_{i \geq 0} \Sigma^i = \Sigma^0 \cup \Sigma^1 \cup \Sigma^2 \cup \ldots

or more formally:

\Sigma^* = \{\varepsilon\} \cup \{x_1x_2 \ldots x_n \mid i,n \in \Bbb{N}^+, x_i \in \Sigma\}

Example

$\Sigma = \{a, b\}$ , $\Sigma^* = \{\varepsilon, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, \ldots\}$ .
$\Sigma = \{0, 1\}$ , $\Sigma^* = \{\varepsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, \ldots\}$ .

Kleene Plus

The Kleene plus is a variation of the Kleene star that does not include the empty word. It is the set of all words of length greater than zero that can be made from the alphabet.

\Sigma^+ = \Sigma^* - \{\varepsilon\} = \bigcup_{i \geq 1} \Sigma^i = \Sigma^1 \cup \Sigma^2 \cup \Sigma^3 \cup \ldots

Example

$\Sigma = \{a, b\}$ , $\Sigma^+ = \{a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, \ldots\}$ .
$\Sigma = \{0, 1\}$ , $\Sigma^+ = \{0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, \ldots\}$ .

Subwords

A subword of a word is a word that can be obtained by deleting zero or more symbols from the original word. Subwords are also called substrings or infixes. The word $v$ is a subword of the word $w$ if there are words $x$ and $y$ in the same alphabet such that:

w = xvy

or more formally:

v \text{ is a subword of } w \iff \exists x, y \in \Sigma^* : w = xvy

Because $x$ and $y$ can be the empty word, every word is a subword of itself. This also leads to the empty word being a subword of every word as well because we can always choose $x$ to be the empty word and $y$ to be the word itself, or vice versa, or some split in between.

Visual representation of subwords of a word.

A proper subword is a subword that is not the same as the original word.

\begin{align*} w & = xvy \\ v & \neq w \\ |v| & < |w| \\ v \text{ is a proper subword of } w & \iff \exists x, y \in \Sigma^*- \{\varepsilon\} : w = xvy \end{align*}

This also means that either $x$ or $y$ or both are not the empty word and that $|v| < |w|$ .

Example

$w = abacaba$ , $v = abacaba$ is a subword of $w$ .
$w = abacaba$ , $v = bac$ is a proper subword of $w$ .
$w = 101010$ , $v = 101$ is a proper subword of $w$ .
$w = 101010$ , $v = \varepsilon$ is a proper subword of $w$ .
$w = 101010$ , $v = abc$ is not a subword of $w$ .

Prefix

A prefix of a word is a subword that starts at the beginning of the word. This means that the prefix $v$ of a word $w$ is a subword of $w$ such that there is a word $y$ in the same alphabet such that:

w = vy

Because the prefix is a special case of a subword it follows all the same rules as a subword and also has the differenation between normal and proper prefixes.

Visual representation of prefixes of a word.

Example

$w = abacaba$ , $v = a$ is a proper prefix of $w$ .
$w = abacaba$ , $v = abacaba$ is a prefix of $w$ .
$w = 101010$ , $v = 101$ is a proper prefix of $w$ .
$w = 101010$ , $v = \varepsilon$ is a proper prefix of $w$ .
$w = 101010$ , $v = 010$ is not a prefix of $w$ .

Suffix

A suffix of a word is a subword that is at the end of the word. This means that the suffix $v$ of a word $w$ is a subword of $w$ such that there is a word $x$ in the same alphabet such that:

w = xv

or more formally:

v \text{ is a suffix of } w \iff \exists x \in \Sigma^* : w = xv

Because the suffix is a special case of a subword it follows all the same rules as a subword and also has the differenation between normal and proper suffixes.

Visual representation of suffixes of a word.

Example

$w = abacaba$ , $v = a$ is a suffix of $w$ .
$w = abacaba$ , $v = abacaba$ is a suffix of $w$ .
$w = 101010$ , $v = 010$ is a proper suffix of $w$ .
$w = 101010$ , $v = \varepsilon$ is a proper suffix of $w$ .
$w = 101010$ , $v = 101$ is not a suffix of $w$ .

Languages

A language is a specific set of words over an alphabet commonly denoted by the symbol $L$ . More precisly a language is a subset of the set of all words over an alphabet.

L \subseteq \Sigma^*

Example

We can define the english language as a language over the alphabet including all the letters of the English alphabet and some punctuation marks. The language $L_\text{eng}$ would then include all the words that are valid in the English language.

L_\text{eng} = \{\text{hello}, \text{world}, \text{cat}, \text{dog}, \ldots\}

Programming languages (such as C) are languages based on the ASCII character set, where the programs must have a correct syntax.

L_\text{C} = \{\text{int}, \text{main}, \text{printf}, \text{scanf}, \ldots\}

The language of all binary numbers with the same number of 0s and 1s is a language over the alphabet $\{0, 1\}$ .

L = \{00, 11, 0011, 0101, 0110, 1001, 1010, 1100, \ldots\}

Sub and Supersets of Languages

If $L$ is a language over an alphabet $\Sigma_1$ and $\Sigma_2$ is a superset of $\Sigma_1$ , then $L$ is obviously also a language over $\Sigma_2$ :

L \subseteq \Sigma_1^* \subseteq \Sigma_2^*

The empty set is a language over every alphabet because it is a subset of every set of words. It is often denoted by the symbol $L_\emptyset$ or just $\emptyset$ .

L_\emptyset \subseteq \Sigma^* \text{ for all } \Sigma

The language that contains only the empty word is also a language over every alphabet. It is often denoted by the symbol $L_\varepsilon$ or just $\{\varepsilon\}$ .

L_{\varepsilon} \subseteq \Sigma^* \text{ for all } \Sigma

However, this does not mean that the empty set and the set containing only the empty word are the same.

L_\emptyset \neq L_{\varepsilon}

Operations on Languages

Because languages are just sets we can apply the common set operations to them. This includes union, intersection, difference, and complement. These are all self explanatory and work the way as expected. However, when the operation is binary, i.e. it takes two languages as input, the two languages must be over the same alphabet.

Info

Union: $L_1 \cup L_2 = \{w \mid w \in L_1 \text{ or } w \in L_2\}$
Intersection: $L_1 \cap L_2 = \{w \mid w \in L_1 \text{ and } w \in L_2\}$
Difference: $L_1 - L_2 = \{w \mid w \in L_1 \text{ and } w \notin L_2\}$
Complement: $L^C = \{w \mid w \notin L\}$ or $L_1^C = \Sigma^* - L_1$ i.e the set of all words over the alphabet that are not in the language.

In addition to these set operations, we can also apply the operations of concatenation and the Kleene star to languages. These operations are defined in the same way as for words.

Concatenation

When concatenating two languages, we concatenate every word in the first language with every word in the second language.

L_1 \cdot L_2 = \{w_1 \cdot w_2 \mid w_1 \in L_1, w_2 \in L_2\}

This is also called the product of two languages. The concatenation of two languages is very similar to the cartesian product of two sets. This is clearly visible in the example below.

Example

$L_1 = \{a, b\}$ , $L_2 = \{c, d\}$ , $L_1 \cdot L_2 = \{ac, ad, bc, bd\}$ .
$L_1 = \{101, 010\}$ , $L_2 = \{11, 00\}$ , $L_1 \cdot L_2 = \{10111, 10100, 01011, 01000\}$ .

Kleene Star

Using the concatenation operation we can also define the Kleene star of a language. The Kleene star of a language is the set of all words that can be made by concatenating any number of words from the original languages. For this we define the following:

\begin{align*} L^0 & = \{\varepsilon\} \\ L^1 & = L \\ L^2 & = L \cdot L \\ L^{n+1} & = L \cdot L^n \end{align*}

We can then define the Kleene star of a language as:

L^* = \bigcup_{i \geq 0} L^i = L^0 \cup L^1 \cup L^2 \cup \ldots

We can also define the Kleene plus, i.e the set of all words that can be made by concatenating any number of words from the original language except the empty word:

L^+ = L^* - \{\varepsilon\} = \bigcup_{i \geq 1} L^i = L^1 \cup L^2 \cup L^3 \cup \ldots

Example

$L = \{a, b\}$ , $L^* = \{\varepsilon, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, \ldots\}$ .
$L = \{101, 010\}$ , $L^* = \{\varepsilon, 101, 010, 101101, 101010, 010101, 010010, 101101101, 101101010, \ldots\}$ .

Template Method Computational Problems