Chomsky Normal Form (CNF)

A Context-free Grammar \(G = (V, \Sigma, R, S)\) is in Chomsky Normal Form (CNF) if all its rules are in one of these forms
- \(A\to BC\), for some \(A\in V\) and some \(B,C \in V\setminus \{S\}\)
- \(A\to a\), for some \(A\in V\) and some \(a\in\Sigma\)
- \(S\to \varepsilon\)

CNF is completely general

Theorem. Every CFG has an equivalent CFG in CNF.
Proof Idea. Step by step, we convert the current grammar to an equivalent one. Each step reduces some number of violations to CNF rules until none remains at the end.

We’ll explain the conversion algorithm by using it to convert to CNF the CFG having these rules:

\(S \to ASA \ |\ aB\)
\(A \to B \ |\ S\)
\(B \to b \ |\ \varepsilon\)

Step 1: TERM

In Step 1, we make sure that if there is a rule \(A\to\alpha\) where \(\alpha \in (V\cup\Sigma)^*\) and \(|\alpha| \ge 2\), then \(\alpha\) contains no terminal.
For every symbol \(a\in\Sigma\) that occurs on the right side of some such rule, we do the followings

introduce a new variable \(U_a\)
replace every occurrence of \(a\) on every such right side by \(U_a\)
add a rule \(U_a\to a\)

Performing this step to the given grammar (left) yields this equivalent grammar (right)

\(S \to ASA \ |\ aB\)

\(A \to B \ |\ S\)

\(B \to b \ |\ \varepsilon\)

\(S \to ASA \ |\ U B\)

\(A \to B \ |\ S\)

\(B \to b \ |\ \varepsilon\)

\(U \to a\)

Step 2: BIN

In Step 2, we eliminate all rules whose right side’s length is more than 2. We substitute any rule of the form \(A\to B_1B_2\dots B_k\), where \(k>2\), by these \(k-2\) rules

\(A\to B_1C_1\)
\(C_1\to B_2C_2\)
\(\qquad\vdots\)
\(C_{k-2}\to B_{k-1}B_k\)

(Note that \(C_1,\dots,C_{k-2}\) are new variables.)

Performing this step to the grammar from previous step (left) yields this equivalent grammar (right)

\(S \to ASA \ |\ U B\)

\(A \to B \ |\ S\)

\(B \to b \ |\ \varepsilon\)

\(U \to a\)

\(S \to AC \ |\ U B\)

\(C \to SA\)

\(A \to B \ |\ S\)

\(B \to b \ |\ \varepsilon\)

\(U \to a\)

Step 3: START

In Step 3, we eliminate all rules having the start variable on the right side. We first introduce a new variable \(S'\), then change all occurrences of \(S\) anywhere in any rule to \(S'\), and finally add a new rule \(S\to S'\).

Performing this step to our grammar from previous step (left), we get the new grammar (right)

\(S \to AC \ |\ U B\)

\(C \to SA\)

\(A \to B \ |\ S\)

\(B \to b \ |\ \varepsilon\)

\(U \to a\)

\(S \to S'\)

\(S' \to AC \ |\ U B\)

\(C \to S'A\)

\(A \to B \ |\ S'\)

\(B \to b \ |\ \varepsilon\)

\(U \to a\)

Step 4: DEL

A variable \(A\) is called nullable if \(A\Rightarrow^*\varepsilon\).

An \(\varepsilon\)-rule is a rule that has \(\varepsilon\) as its right side.

Step 4 has a number of substeps, whose purpose is to remove all forbidden \(\varepsilon\)-rules.

We find all nullable variables using this algorithm.

N := {A : the grammar has a rule A → ε};
while (there exists a rule A → B with B in N
  but A not in N, or there exists a rule
  A → BC with both B,C in N but A not in N)
do {
  add A to N;
}
return N; // N is the set of all nullable variables

Step 4 continued

For each rule whose right side has length 2 and has exactly one nullable variable \(A\), say \(X\to AY\) or \(X\to YA\), we add a rule \(X\to Y\) (unless \(X=Y\)).
For each rule \(X\to AB\) where both \(A\) and \(B\) are nullable, we add rules \(X\to A\) (unless \(X=A\)) and \(X\to B\) (unless \(X=B\)). (Note that \(A\) and \(B\) may very well be the same. In that case, we need only add \(X\to A\) once.)
Remove all \(\varepsilon\)-rules.
Add rule \(S\to\varepsilon\) if \(S\) is nullable.

Step 4 continued

In the grammar from previous step (left), only \(A\) and \(B\) are nullable. Applying steps 4.2–4.5 to it results in the equivalent grammar on the right.

\(S \to S'\)

\(S' \to AC \ |\ U B\)

\(C \to S'A\)

\(A \to B \ |\ S'\)

\(B \to b \ |\ \varepsilon\)

\(U \to a\)

\(S \to S'\)

\(S' \to AC\ |\ C \ |\ U B \ |\ U\)

\(C \to S'A \ |\ S'\)

\(A \to B \ |\ S'\)

\(B \to b\)

\(U \to a\)

Step 5: UNIT

In Step 5, we eliminate all unit rules, i.e., those of the form \(A\to B\) where \(B\) is also a variable.
First, delete every rule of the form \(A\to A\) as they do not contribute anything to the generated language. (In fact, any time during our whole algorithm if we discover any such rule, we should delete it from the grammar.)
Next, create a directed graph showing all the unit rules. The vertices of the graph are the variables participating in some unit rule. There’s an edge from vertex \(A\) to vertex \(B\) if and only if the grammar has a unit rule \(A\to B\).
On the right is the unit rules graph of our current grammar on the left

\(S \to S'\)

\(S' \to AC\ |\ C \ |\ U B \ |\ U\)

\(C \to S'A \ |\ S'\)

\(A \to B \ |\ S'\)

\(B \to b\)

\(U \to a\)

Step 5 continued

Step 5 has two substeps
- The first substep gets rid of all cycles in the graphs.
- The second substep deletes vertices (and edges) from the graphs.

This first substep removes all cycles from the graph, and at the same time alters the grammer to maintain equivalence.

    while (the graph has some cycle) do {
      choose a cycle C;
      let X be an arbitrary vertex in C;
      for (each rule) do {  
        if (any vertex Y of cycle C appears in the rule  
            and Y is not equal to X)
        then {
          change every occurrence of Y in the rule to X;  
        }
      }  
      contract the cycle C;  
    }

Step 5 continued

\(S \to S'\)

\(S' \to AC\ |\ C \ |\ U B \ |\ U\)

\(C \to S'A \ |\ S'\)

\(A \to B \ |\ S'\)

\(B \to b\)

\(U \to a\)

Our graph has one cycle \(S' \to C \to S'\). Choosing \(C\) as the name of the cycle soon to be contracted, we change every occurrence of \(S'\) in any rule to \(C\), then contract the cycle. It results in this grammar and graph:

\(S \to C\)

\(C \to AC \ |\ U B \ |\ U \ |\ CA\)

\(A \to B \ |\ C\)

\(B \to b\)

\(U \to a\)

The graph no longer has any cycle. We move on to the next substep.

Step 5 continued

The second substep is to delete all edges of the graphs, and at the same time alters the grammer to maintain equivalence.

while (the graph has some vertex with entering edge) do {
  let X be a vertex with some entering edge but no leaving edge;  
  delete X and all its entering edges from the graph;
  for (each unit rule of the form A → X) do {
    delete the rule A → X from the grammar;
    for (each rule X → α) do
      add rule A → α to the grammar;
  }
}

Step 5 continued

Coming back to our grammar. We can either delete vertex \(U\) or vertex \(B\). Let’s choose \(U\) as victim. We delete \(U\) from the graph and alter the upper grammar

\(S \to C\)

\(C \to AC \ |\ U B \ |\ U \ |\ CA\)

\(A \to B \ |\ C\)

\(B \to b\)

\(U \to a\)

to get this grammar and graph

\(S \to C\)

\(C \to AC \ |\ U B \ |\ a \ |\ CA\)

\(A \to B \ |\ C\)

\(B \to b\)

\(U \to a\)

Step 5 continued

\(S \to C\)

\(C \to AC \ |\ U B \ |\ a \ |\ CA\)

\(A \to B \ |\ C\)

\(B \to b\)

\(U \to a\)

At this point we can delete either vertex \(B\) or vertex \(C\). Let’s choose \(B\) as victim. We delete \(B\) and alter the grammar to get the lower grammar and graph.

\(S \to C\)

\(C \to AC \ |\ U B \ |\ a \ |\ CA\)

\(A \to b \ |\ C\)

\(B \to b\)

\(U \to a\)

Step 5 continued

\(S \to C\)

\(C \to AC \ |\ U B \ |\ a \ |\ CA\)

\(A \to b \ |\ C\)

\(B \to b\)

\(U \to a\)

At this point \(C\) is the only vertex to be deleted. We delete it and alter the grammer to

\(S \to AC\ |\ U B\ |\ a \ | \ CA\)

\(C \to AC\ |\ U B\ |\ a \ | \ CA\)

\(A \to b\ |\ AC \ |\ U B \ |\ a \ |\ CA\)

\(B \to b\)

\(U \to a\)

The resulting graph now has no edge, and we are done! Above is the final grammar in CNF.

Chomsky Normal Form

San Skulrattanakulchai

March 21, 2019

Chomsky Normal Form (CNF)

CNF is completely general

Step 1: TERM

Step 2: BIN

Step 3: START

Step 4: DEL

Step 4 continued

Step 4 continued

Step 5: UNIT

Step 5 continued

Step 5 continued

Step 5 continued

Step 5 continued

Step 5 continued

Step 5 continued