Formal Languages
Alphabet
An alphabet is a finite non-empty set of symbols. Sometimes these symbols are called letters and the alphabet is called a vocabulary. An alphabet is usually denoted by the symbol and the symbols in the alphabet are in lower case.
- is an alphabet with three symbols.
- is an alphabet with two symbols representing boolean or binary values.
- is an alphabet with all the lowercase letters of the Latin alphabet.
- is an alphabet with the digits 0 to 9.
- is not an alphabet because it is not finite.
- is not an alphabet because it is empty.
Words
A word is a finite sequence of symbols from an alphabet also commonly called a string. Sequence elements are commonly seperated by commas. However in the context of formal languages, we often omit the commas and write the sequence of symbols without any seperators as long as no ambiguity arises.
The length of a word is calculated just like the length of a sequence, by counting the number of elements in the sequence, i.e. the number of symbols in the word.
- is a word with three symbols. .
- is the same word as above but written without commas.
- is a word over the alphabet .
- is a word over the alphabet .
- is not a word over the alphabet .
Empty Word
The empty word is a word with no symbols. It is often denoted by or . Just like the empty set is a subset of every set, the empty word is a word over every alphabet.
Because the empty word has no symbols, its length is zero, i.e. .
Number of Occurrences
The number of occurrences of a symbol in a word is the number of times that symbol appears in the word. This is often denoted by where is the symbol and is the word.
- , .
- , .
Using this we can also define the length of a word as the sum of the number of occurrences of each symbol in the word:
Concatenating Words
Concatenating two words is the operation of joining the two words together to form a new word. This is often denoted by the symbol . So we define the concatenation of two words and as follows:
This is just like in normal arithmetic where we can omit the multiplication sign when multiplying variables. When concatonating words the length of the new word is the sum of the lengths of the two words.
This then also gives a logical definition of the concatenation of the empty word with any word. The concatenation of the empty word with any word is just the word itself. This is because the empty word has no symbols and no length, so adding it to any word does not change the word or its length.
However unlike normal multiplication, the order of the words matters in concatenation, i.e. the concatenation operation is not commutative.
However, the concatenation operation is associative:
- , , .
- , , .
- , , .
Powers of Words
The power of a word is the concatenation of the word with itself a certain number of times. This is denoted by a superscript and also often called iteration or repetition of a word. The power of a word has to be a non-negative integer, . Just like with powers of numbers where is the identity element for the power of , the empty word is the identity element for the power of for words.
Using the definition of concatenation, we can also write this as:
Which also leads to a recursive definition of the power of a word:
- , .
- , .
- , .
- , .
Using the power of words we can write long sequences of the same symbol in a more compact way, this can however lead to ambigous words:
Reverse of a Word
The reverse of a word is the word with the symbols in the opposite order. This is denoted by the symbol .
- , .
- , .
- , .
- , .
We can also define the reverse of the concatenation of two words as the concatenation of the reverse of the two words in reverse order:
This is best shown by example:
Sets of Words
If we have an alphabet , we can define some special sets of words over that alphabet. Specifically we can define the set of all words of a certain length by using the power of the alphabet.
This set is finite because the alphabet is finite and the length of the words is fixed.
- , .
- , .
- , .
Kleene Star
The Kleene star is a special set of words that includes all words of any length that can be made from the alphabet. It is the set of all words over the alphabet.
or more formally:
- , .
- , .
Kleene Plus
The Kleene plus is a variation of the Kleene star that does not include the empty word. It is the set of all words of length greater than zero that can be made from the alphabet.
- , .
- , .
Subwords
A subword of a word is a word that can be obtained by deleting zero or more symbols from the original word. Subwords are also called substrings or infixes. The word is a subword of the word if there are words and in the same alphabet such that:
or more formally:
Because and can be the empty word, every word is a subword of itself. This also leads to the empty word being a subword of every word as well because we can always choose to be the empty word and to be the word itself, or vice versa, or some split in between.
A proper subword is a subword that is not the same as the original word.
This also means that either or or both are not the empty word and that .
- , is a subword of .
- , is a proper subword of .
- , is a proper subword of .
- , is a proper subword of .
- , is not a subword of .
Prefix
A prefix of a word is a subword that starts at the beginning of the word. This means that the prefix of a word is a subword of such that there is a word in the same alphabet such that:
Because the prefix is a special case of a subword it follows all the same rules as a subword and also has the differenation between normal and proper prefixes.
- , is a proper prefix of .
- , is a prefix of .
- , is a proper prefix of .
- , is a proper prefix of .
- , is not a prefix of .
Suffix
A suffix of a word is a subword that is at the end of the word. This means that the suffix of a word is a subword of such that there is a word in the same alphabet such that:
or more formally:
Because the suffix is a special case of a subword it follows all the same rules as a subword and also has the differenation between normal and proper suffixes.
- , is a suffix of .
- , is a suffix of .
- , is a proper suffix of .
- , is a proper suffix of .
- , is not a suffix of .
Languages
A language is a specific set of words over an alphabet commonly denoted by the symbol . More precisly a language is a subset of the set of all words over an alphabet.
We can define the english language as a language over the alphabet including all the letters of the English alphabet and some punctuation marks. The language would then include all the words that are valid in the English language.
Programming languages (such as C) are languages based on the ASCII character set, where the programs must have a correct syntax.
The language of all binary numbers with the same number of 0s and 1s is a language over the alphabet .
Sub and Supersets of Languages
If is a language over an alphabet and is a superset of , then is obviously also a language over :
The empty set is a language over every alphabet because it is a subset of every set of words. It is often denoted by the symbol or just .
The language that contains only the empty word is also a language over every alphabet. It is often denoted by the symbol or just .
However, this does not mean that the empty set and the set containing only the empty word are the same.
Operations on Languages
Because languages are just sets we can apply the common set operations to them. This includes union, intersection, difference, and complement. These are all self explanatory and work the way as expected. However, when the operation is binary, i.e. it takes two languages as input, the two languages must be over the same alphabet.
- Union:
- Intersection:
- Difference:
- Complement: or i.e the set of all words over the alphabet that are not in the language.
In addition to these set operations, we can also apply the operations of concatenation and the Kleene star to languages. These operations are defined in the same way as for words.
Concatenation
When concatenating two languages, we concatenate every word in the first language with every word in the second language.
This is also called the product of two languages. The concatenation of two languages is very similar to the cartesian product of two sets. This is clearly visible in the example below.
- , , .
- , , .
Kleene Star
Using the concatenation operation we can also define the Kleene star of a language. The Kleene star of a language is the set of all words that can be made by concatenating any number of words from the original languages. For this we define the following:
We can then define the Kleene star of a language as:
We can also define the Kleene plus, i.e the set of all words that can be made by concatenating any number of words from the original language except the empty word:
- , .
- , .