Digital Garden
Computer Science
Theoretical CS
Formal Languages

Formal Languages

Alphabet

An alphabet is a finite non-empty set of symbols. Sometimes these symbols are called letters and the alphabet is called a vocabulary. An alphabet is usually denoted by the symbol Σ\Sigma and the symbols in the alphabet are in lower case.

Example
  • Σ={a,b,c}\Sigma = \{a, b, c\} is an alphabet with three symbols.
  • Σbool={0,1}\Sigma_{\text{bool}} = \{0, 1\} is an alphabet with two symbols representing boolean or binary values.
  • Σlat={a,b,c,,z}\Sigma_{\text{lat}} = \{a, b, c, \ldots, z\} is an alphabet with all the lowercase letters of the Latin alphabet.
  • Σ={0,1,2,,9}\Sigma = \{0, 1, 2, \ldots, 9\} is an alphabet with the digits 0 to 9.
  • N\Bbb{N} is not an alphabet because it is not finite.
  • Σ={}\Sigma = \{\} is not an alphabet because it is empty.

Words

A word is a finite sequence of symbols from an alphabet also commonly called a string. Sequence elements are commonly seperated by commas. However in the context of formal languages, we often omit the commas and write the sequence of symbols without any seperators as long as no ambiguity arises.

The length of a word is calculated just like the length of a sequence, by counting the number of elements in the sequence, i.e. the number of symbols in the word.

Example
  • w=a,b,cw = a, b, c is a word with three symbols. w=3|w| = 3.
  • w=abcw = abc is the same word as above but written without commas.
  • w=wordw = word is a word over the alphabet Σlat\Sigma_{\text{lat}}.
  • w=101w = 101 is a word over the alphabet Σbool\Sigma_{\text{bool}}.
  • w=101010w = 101010 is not a word over the alphabet Σlat\Sigma_{\text{lat}}.

Empty Word

The empty word is a word with no symbols. It is often denoted by ε\varepsilon or λ\lambda. Just like the empty set is a subset of every set, the empty word is a word over every alphabet.

Because the empty word has no symbols, its length is zero, i.e. ε=0|\varepsilon| = 0.

Number of Occurrences

The number of occurrences of a symbol in a word is the number of times that symbol appears in the word. This is often denoted by wa|w|_a where aa is the symbol and ww is the word.

Example
  • w=abacabaw = abacaba, wa=4|w|_a = 4.
  • w=101010w = 101010, w0=3|w|_0 = 3.

Using this we can also define the length of a word as the sum of the number of occurrences of each symbol in the word:

w=aΣwa|w| = \sum_{a \in \Sigma} |w|_a

Concatenating Words

Concatenating two words is the operation of joining the two words together to form a new word. This is often denoted by the symbol \cdot. So we define the concatenation of two words uu and vv as follows:

uv=uvu \cdot v = uv

This is just like in normal arithmetic where we can omit the multiplication sign when multiplying variables. When concatonating words the length of the new word is the sum of the lengths of the two words.

uv=u+v|u \cdot v| = |u| + |v|

This then also gives a logical definition of the concatenation of the empty word with any word. The concatenation of the empty word with any word is just the word itself. This is because the empty word has no symbols and no length, so adding it to any word does not change the word or its length.

Info

However unlike normal multiplication, the order of the words matters in concatenation, i.e. the concatenation operation is not commutative.

uvvuu \cdot v \neq v \cdot u

However, the concatenation operation is associative:

(uv)w=u(vw)(u \cdot v) \cdot w = u \cdot (v \cdot w)
Example
  • u=abu = ab, v=cdv = cd, uv=abcdu \cdot v = abcd.
  • u=101u = 101, v=010v = 010, uv=101010u \cdot v = 101010.
  • u=εu = \varepsilon, v=abcv = abc, uv=abcu \cdot v = abc.

Powers of Words

The power of a word is the concatenation of the word with itself a certain number of times. This is denoted by a superscript and also often called iteration or repetition of a word. The power nn of a word has to be a non-negative integer, nN0n \in \Bbb{N}_0. Just like with powers of numbers where 11 is the identity element for the power of 00, the empty word is the identity element for the power of 00 for words.

wn=i=1nww^n = \prod_{i=1}^{n} w

Using the definition of concatenation, we can also write this as:

wn=wwwn timesw^n = \underbrace{w \cdot w \cdot \ldots \cdot w}_{n \text{ times}}

Which also leads to a recursive definition of the power of a word:

wn={εif n=0wwn1if n>0w^n = \begin{cases} \varepsilon & \text{if } n = 0 \\ w \cdot w^{n-1} & \text{if } n > 0 \end{cases}
Example
  • w=101w = 101, w3=101101101w^3 = 101101101.
  • w=abcw = abc, w2=abcabcw^2 = abcabc.
  • w=101w = 101, w0=εw^0 = \varepsilon.
  • w=εw = \varepsilon, w2=εw^2 = \varepsilon.

Using the power of words we can write long sequences of the same symbol in a more compact way, this can however lead to ambigous words:

w=bbababababbaaaabab=b2(ab)4ba4bab=b(ba)4b2a3(ab)2w = bbababababbaaaabab = b^2(ab)^4ba^4bab = b(ba)^4b^2a^3(ab)^2

Reverse of a Word

The reverse of a word is the word with the symbols in the opposite order. This is denoted by the symbol wRw^R.

w=a1a2anwR=anan1a2a1\begin{align*} w & = a_1a_2 \ldots a_n \\ w^R & = a_na_{n-1} \ldots a_2a_1 \end{align*}
Example
  • w=101w = 101, wR=101w^R = 101.
  • w=abcw = abc, wR=cbaw^R = cba.
  • w=101010w = 101010, wR=010101w^R = 010101.
  • w=εw = \varepsilon, wR=εw^R = \varepsilon.

We can also define the reverse of the concatenation of two words as the concatenation of the reverse of the two words in reverse order:

(uv)R=vRuR(u \cdot v)^R = v^R \cdot u^R
Example

This is best shown by example:

u=abv=cduv=abcd(uv)R=dcbavRuR=dcba\begin{align*} u & = ab \\ v & = cd \\ u \cdot v & = abcd \\ (u \cdot v)^R & = dcba \\ v^R \cdot u^R & = dcba \end{align*}

Sets of Words

If we have an alphabet Σ\Sigma, we can define some special sets of words over that alphabet. Specifically we can define the set of all words of a certain length by using the power of the alphabet.

Σn=set of all words of length n over the alphabet Σ\Sigma^n = \text {set of all words of length } n \text{ over the alphabet } \Sigma

This set is finite because the alphabet is finite and the length of the words is fixed.

Example
  • Σ={a,b}\Sigma = \{a, b\}, Σ2={aa,ab,ba,bb}\Sigma^2 = \{aa, ab, ba, bb\}.
  • Σ={0,1}\Sigma = \{0, 1\}, Σ3={000,001,010,011,100,101,110,111}\Sigma^3 = \{000, 001, 010, 011, 100, 101, 110, 111\}.
  • Σ={a,b,c}\Sigma = \{a, b, c\}, Σ0={ε}\Sigma^0 = \{\varepsilon\}.

Kleene Star

The Kleene star is a special set of words that includes all words of any length that can be made from the alphabet. It is the set of all words over the alphabet.

Σ=i0Σi=Σ0Σ1Σ2\Sigma^* = \bigcup_{i \geq 0} \Sigma^i = \Sigma^0 \cup \Sigma^1 \cup \Sigma^2 \cup \ldots

or more formally:

Σ={ε}{x1x2xni,nN+,xiΣ}\Sigma^* = \{\varepsilon\} \cup \{x_1x_2 \ldots x_n \mid i,n \in \Bbb{N}^+, x_i \in \Sigma\}
Example
  • Σ={a,b}\Sigma = \{a, b\}, Σ={ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,}\Sigma^* = \{\varepsilon, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, \ldots\}.
  • Σ={0,1}\Sigma = \{0, 1\}, Σ={ε,0,1,00,01,10,11,000,001,010,011,100,101,110,111,}\Sigma^* = \{\varepsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, \ldots\}.

Kleene Plus

The Kleene plus is a variation of the Kleene star that does not include the empty word. It is the set of all words of length greater than zero that can be made from the alphabet.

Σ+=Σ{ε}=i1Σi=Σ1Σ2Σ3\Sigma^+ = \Sigma^* - \{\varepsilon\} = \bigcup_{i \geq 1} \Sigma^i = \Sigma^1 \cup \Sigma^2 \cup \Sigma^3 \cup \ldots
Example
  • Σ={a,b}\Sigma = \{a, b\}, Σ+={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,}\Sigma^+ = \{a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, \ldots\}.
  • Σ={0,1}\Sigma = \{0, 1\}, Σ+={0,1,00,01,10,11,000,001,010,011,100,101,110,111,}\Sigma^+ = \{0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, \ldots\}.

Subwords

A subword of a word is a word that can be obtained by deleting zero or more symbols from the original word. Subwords are also called substrings or infixes. The word vv is a subword of the word ww if there are words xx and yy in the same alphabet such that:

w=xvyw = xvy

or more formally:

v is a subword of w    x,yΣ:w=xvyv \text{ is a subword of } w \iff \exists x, y \in \Sigma^* : w = xvy

Because xx and yy can be the empty word, every word is a subword of itself. This also leads to the empty word being a subword of every word as well because we can always choose xx to be the empty word and yy to be the word itself, or vice versa, or some split in between.

Visual representation of subwords of a word.

A proper subword is a subword that is not the same as the original word.

w=xvyvwv<wv is a proper subword of w    x,yΣ{ε}:w=xvy\begin{align*} w & = xvy \\ v & \neq w \\ |v| & < |w| \\ v \text{ is a proper subword of } w & \iff \exists x, y \in \Sigma^*- \{\varepsilon\} : w = xvy \end{align*}

This also means that either xx or yy or both are not the empty word and that v<w|v| < |w|.

Example
  • w=abacabaw = abacaba, v=abacabav = abacaba is a subword of ww.
  • w=abacabaw = abacaba, v=bacv = bac is a proper subword of ww.
  • w=101010w = 101010, v=101v = 101 is a proper subword of ww.
  • w=101010w = 101010, v=εv = \varepsilon is a proper subword of ww.
  • w=101010w = 101010, v=abcv = abc is not a subword of ww.

Prefix

A prefix of a word is a subword that starts at the beginning of the word. This means that the prefix vv of a word ww is a subword of ww such that there is a word yy in the same alphabet such that:

w=vyw = vy

Because the prefix is a special case of a subword it follows all the same rules as a subword and also has the differenation between normal and proper prefixes.

Visual representation of prefixes of a word.
Example
  • w=abacabaw = abacaba, v=av = a is a proper prefix of ww.
  • w=abacabaw = abacaba, v=abacabav = abacaba is a prefix of ww.
  • w=101010w = 101010, v=101v = 101 is a proper prefix of ww.
  • w=101010w = 101010, v=εv = \varepsilon is a proper prefix of ww.
  • w=101010w = 101010, v=010v = 010 is not a prefix of ww.

Suffix

A suffix of a word is a subword that is at the end of the word. This means that the suffix vv of a word ww is a subword of ww such that there is a word xx in the same alphabet such that:

w=xvw = xv

or more formally:

v is a suffix of w    xΣ:w=xvv \text{ is a suffix of } w \iff \exists x \in \Sigma^* : w = xv

Because the suffix is a special case of a subword it follows all the same rules as a subword and also has the differenation between normal and proper suffixes.

Visual representation of suffixes of a word.
Example
  • w=abacabaw = abacaba, v=av = a is a suffix of ww.
  • w=abacabaw = abacaba, v=abacabav = abacaba is a suffix of ww.
  • w=101010w = 101010, v=010v = 010 is a proper suffix of ww.
  • w=101010w = 101010, v=εv = \varepsilon is a proper suffix of ww.
  • w=101010w = 101010, v=101v = 101 is not a suffix of ww.

Languages

A language is a specific set of words over an alphabet commonly denoted by the symbol LL. More precisly a language is a subset of the set of all words over an alphabet.

LΣL \subseteq \Sigma^*
Example

We can define the english language as a language over the alphabet including all the letters of the English alphabet and some punctuation marks. The language LengL_\text{eng} would then include all the words that are valid in the English language.

Leng={hello,world,cat,dog,}L_\text{eng} = \{\text{hello}, \text{world}, \text{cat}, \text{dog}, \ldots\}

Programming languages (such as C) are languages based on the ASCII character set, where the programs must have a correct syntax.

LC={int,main,printf,scanf,}L_\text{C} = \{\text{int}, \text{main}, \text{printf}, \text{scanf}, \ldots\}

The language of all binary numbers with the same number of 0s and 1s is a language over the alphabet {0,1}\{0, 1\}.

L={00,11,0011,0101,0110,1001,1010,1100,}L = \{00, 11, 0011, 0101, 0110, 1001, 1010, 1100, \ldots\}

Sub and Supersets of Languages

If LL is a language over an alphabet Σ1\Sigma_1 and Σ2\Sigma_2 is a superset of Σ1\Sigma_1, then LL is obviously also a language over Σ2\Sigma_2:

LΣ1Σ2L \subseteq \Sigma_1^* \subseteq \Sigma_2^*

The empty set is a language over every alphabet because it is a subset of every set of words. It is often denoted by the symbol LL_\emptyset or just \emptyset.

LΣ for all ΣL_\emptyset \subseteq \Sigma^* \text{ for all } \Sigma

The language that contains only the empty word is also a language over every alphabet. It is often denoted by the symbol LεL_\varepsilon or just {ε}\{\varepsilon\}.

LεΣ for all ΣL_{\varepsilon} \subseteq \Sigma^* \text{ for all } \Sigma

However, this does not mean that the empty set and the set containing only the empty word are the same.

LLεL_\emptyset \neq L_{\varepsilon}

Operations on Languages

Because languages are just sets we can apply the common set operations to them. This includes union, intersection, difference, and complement. These are all self explanatory and work the way as expected. However, when the operation is binary, i.e. it takes two languages as input, the two languages must be over the same alphabet.

Info
  • Union: L1L2={wwL1 or wL2}L_1 \cup L_2 = \{w \mid w \in L_1 \text{ or } w \in L_2\}
  • Intersection: L1L2={wwL1 and wL2}L_1 \cap L_2 = \{w \mid w \in L_1 \text{ and } w \in L_2\}
  • Difference: L1L2={wwL1 and wL2}L_1 - L_2 = \{w \mid w \in L_1 \text{ and } w \notin L_2\}
  • Complement: LC={wwL}L^C = \{w \mid w \notin L\} or L1C=ΣL1L_1^C = \Sigma^* - L_1 i.e the set of all words over the alphabet that are not in the language.

In addition to these set operations, we can also apply the operations of concatenation and the Kleene star to languages. These operations are defined in the same way as for words.

Concatenation

When concatenating two languages, we concatenate every word in the first language with every word in the second language.

L1L2={w1w2w1L1,w2L2}L_1 \cdot L_2 = \{w_1 \cdot w_2 \mid w_1 \in L_1, w_2 \in L_2\}

This is also called the product of two languages. The concatenation of two languages is very similar to the cartesian product of two sets. This is clearly visible in the example below.

Example
  • L1={a,b}L_1 = \{a, b\}, L2={c,d}L_2 = \{c, d\}, L1L2={ac,ad,bc,bd}L_1 \cdot L_2 = \{ac, ad, bc, bd\}.
  • L1={101,010}L_1 = \{101, 010\}, L2={11,00}L_2 = \{11, 00\}, L1L2={10111,10100,01011,01000}L_1 \cdot L_2 = \{10111, 10100, 01011, 01000\}.

Kleene Star

Using the concatenation operation we can also define the Kleene star of a language. The Kleene star of a language is the set of all words that can be made by concatenating any number of words from the original languages. For this we define the following:

L0={ε}L1=LL2=LLLn+1=LLn\begin{align*} L^0 & = \{\varepsilon\} \\ L^1 & = L \\ L^2 & = L \cdot L \\ L^{n+1} & = L \cdot L^n \end{align*}

We can then define the Kleene star of a language as:

L=i0Li=L0L1L2L^* = \bigcup_{i \geq 0} L^i = L^0 \cup L^1 \cup L^2 \cup \ldots

We can also define the Kleene plus, i.e the set of all words that can be made by concatenating any number of words from the original language except the empty word:

L+=L{ε}=i1Li=L1L2L3L^+ = L^* - \{\varepsilon\} = \bigcup_{i \geq 1} L^i = L^1 \cup L^2 \cup L^3 \cup \ldots
Example
  • L={a,b}L = \{a, b\}, L={ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,}L^* = \{\varepsilon, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, \ldots\}.
  • L={101,010}L = \{101, 010\}, L={ε,101,010,101101,101010,010101,010010,101101101,101101010,}L^* = \{\varepsilon, 101, 010, 101101, 101010, 010101, 010010, 101101101, 101101010, \ldots\}.