Regular VS Context-free Languages

Typical regular languages are
- \(\{ a^{3n} : n\in\mathbb{N} \}\)
- strings with even number of \(a\)’s
- strings (not) containing \(aba\)
- etc.
Typical context-free languages are
- \(\{ a^nb^n : n\in\mathbb{N} \}\)
- palindromes
- \(\{ ww^R : w\in\Sigma^* \}\)
- balanced parentheses
- strings with equal numbers of \(a\)’s and \(b\)’s
- arithmetic expressions
- etc.

Formal Definition

A Context-free Grammar (CFG) is a 4-tuple \((V, \Sigma, R, S)\) where
- \(V\) is a nonempty finite set of variables (or nonterminals)
- \(\Sigma\) is the alphabet disjoint from \(V\); members of the alphabet are called terminals
- \(R\), the set of rules (or productions), is a set of ordered-pairs \((A, \alpha)\), where \(A\in V\) and \(\alpha\in (V\cup\Sigma)^*\)
- \(S\) is a special variable called the start variable
A rule \((A, \alpha)\) is usually written \(A\to \alpha\).

Here is a simple CFG example. Let
- \(V = \{S\}\)
- \(\Sigma = \{\,(, )\,\}\)
- \(R = \{ S\to (S),\ S\to SS,\ S\to\varepsilon \}\)

Instead of specifying all the components \(V\), \(\Sigma\), \(R\), and \(S\) of a CFG, we usually give just the rules, with these understood assumptions:
- uppercase letters are variables
- lowercase letters and special characters are terminals
- the LHS of the first given rule is assumed to be the start variable
- when more than one rule have the same LHS, their RHS’s can be written in one line, separated by the | symbol.
Example The previous CFG is usually given as

\(S\to (S)\ \mid\ SS\ \mid\ \varepsilon\)

CFG is a string rewriting system where one starts by writing down the start symbol, then at each step, replaces the current string by choosing some variable appearing in it, and replaces that variable by the RHS of some rule that has that variable as its LHS.
Formally, for strings \(\alpha,\gamma\in (V\cup\Sigma)^*\) and rule \(B\to \beta\), we write \(\alpha B\gamma\Rightarrow \alpha\beta\gamma\) and we say that \(\alpha B\gamma\) yields \(\alpha\beta\gamma\).
We write \(\alpha\Rightarrow^* \beta\), and say that \(\alpha\) derives \(\beta\) if \(\beta\) can be obtained from \(\alpha\) in zero or more yield steps.
Example Using our example grammar, \(S\Rightarrow SS\Rightarrow (S)S\Rightarrow ()S\) shows that \(S\Rightarrow^* ()S\).
Trivially, \(\alpha\Rightarrow^* \alpha\) for any string \(\alpha\).

The language generated by the CFG \(G=(V,\Sigma,R,S)\), written \(L(G)\), is the set of all strings over \(\Sigma\) that can be derived from \(S\), that is, \(L(G) = \{ w\in\Sigma^* : S\Rightarrow^* w \}\).
A Context-free Language (CFL) is one that can be generated by some CFG.
Example CFL’s and their CFG’s follows.
- Balanced parentheses: \[ S\to SS \ \mid\ (S) \ \mid\ \varepsilon \]
- \(\{a^nb^n : n\in\mathbb{N}\}\): \[ S\to aSb \ \mid\ \varepsilon \]
- Palindromes: \[ S\to aSa \ \mid\ bSb \ \mid\ a \ \mid\ b \ \mid\ \varepsilon \]
- Equal # of \(a\)’s and \(b\)’s: \[ S\to aSbS \ \mid\ bSaS \ \mid\ \varepsilon \]

A parse tree is a rooted, ordered, node-labeled tree with these properties.
- The root node is labeled with the start variable.
- Every internal node is labeled with a variable.
- Every leaf node is labeled with a terminal or \(\varepsilon\).
- For any internal node, say one with label \(A\), if the labels of its children are read from left to right, then we’ll get a string, say \(\alpha\), such that \(A\to\alpha\) is a grammar rule.
The yield of a parse tree is the string obtained from concatenating the labels of all its leaves from left to right.
Using the grammar of the language of balanced parentheses, we get the string (()) as the yield of this parse tree

A leftmost derivation is a derivation such that every yield step replaces the leftmost variable with its RHS of some rule.
E.g., using the grammar for the balanced parentheses, \[ S \Rightarrow SS \Rightarrow (S)S \Rightarrow ()S \Rightarrow () \] is a leftmost derivation of () from \(S\).
A rightmost derivation is a derivation such that every yield step replaces the rightmost variable with its RHS of some rule.
And this is a rightmost derivation of the same string () from \(S\) \[ S \Rightarrow SS \Rightarrow S(S) \Rightarrow S() \Rightarrow () \]

A string of terminals is ambiguous if it is the yield of at least two distinct parse trees; equivalently, if it has at least two leftmost (or rightmost) derivations from the start variable.
Exercise. Show that the string () is ambiguous.
Answer.
\[ S\Rightarrow (S) \Rightarrow () \] \[ S\Rightarrow SS \Rightarrow (S)S \Rightarrow ()S\Rightarrow () \] are two distinct leftmost derivations of ().
A grammar is ambiguous if its start variable derives some ambiguous string of terminals.
Exercise. Give an unambiguous grammar for the language of balanced parentheses.
Answer. \(S\to (S)S\ \mid\ \varepsilon\)
A CFL is inherently ambiguous if any CFG that generates it is ambiguous.

A right-linear grammar is a CFG such that any nonterminal on the RHS of a rule occurs at the end.
For example, the grammar \[ S\to aS \ \mid\ baS \ \mid\ b \ \mid\ \varepsilon \] generates all strings without 2 consecutive \(b\)’s.
A left-linear grammar is defined similarly.
A linear grammar is one that is either a right-linear grammar or a left-linear grammar.
Theorem. A language \(L\) is regular if and only if it is generated by some linear grammar.