Top-down parsing

Top-down parsing conceptually corresponds to constructing the parse tree by expanding the internal nodes of the final parse tree in preorder, or equivalently, finding the leftmost derivation of the input string.
[Demonstrate by an example here.]
The key problem of the top-down parser is choosing the right production to apply while constructing the parse tree.
A general top-down parser uses backtracking to find the correct production to apply.
A special case occurs for some grammars where the parser can choose the correct production to apply by looking at a fixed number of input symbols (usually one).
Let $k$ be a fixed positive integer. The class of grammars for which we can construct a predictive parser by looking at $k$ symbols ahead is called the $LL(k)$ class.

Recursive-descent parsing

For each nonterminal $A$, we write a recursive procedure A() with this outline:

A() {
    choose an A-production A → X1 X2 ... Xk;
    for i ← 1 to k do {
        if Xi is a nonterminal then
            Xi();
        else if Xi = current input symbol then
            advance input to the next symbol;
        else
            error();
    }
}

The “choose an A-production …” step needs to be tried for all possible A-production in a backtracking manner. This requires updating/downdating the input pointer for every backtracking step as well.

Nullability

Let $\alpha$ be a nonempty string of grammar symbols. We say that $\alpha$ is nullable if $\alpha\Rightarrow^*\epsilon$. It follows that no terminal is nullable,
Any nonterminal $A$ may or may not be nullable. If $R$ contains a production $A\to\epsilon$, then $A$ is obviously nullable. However, any nonterminal $B$ can be nullable even though no production $B\to\epsilon$ exists. We will give an algorithm for computing all the nullable nonterminals on the next slide.
A nonempty string of grammar symbols $\alpha = X_1X_2\cdots X_k$ is nullable if and only if $X_i$ is nullable for all $1\le i \le k$.
Let’s define the empty string $\epsilon$ to be nullable, since $\epsilon$ derives $\epsilon$ in 0 step.

The nullable set of nonterminals

Let $N(G)$ be the set of nullable nonterminals of the CFG $G = (V, \Sigma, R, S)$.

Here is an algorithm for computing the set $N(G)$.

Q ← empty queue;
NG ← empty set;
for all A ∈ V do {
    if A → ε then {
        NG ← NG ∪ {A};
        enqueue(A, Q);
    }
}
while Q is not empty do {
    A ← dequeue(Q);
    mark all occurrences of A in all productions;
    if this marking of A results in all Xi (1 ≤ i ≤ k)
    of any production B → X1 X2 ... Xk being marked
    but B itself is unmarked then {
        NG ← NG ∪ {B};
        enqueue(B, Q);
    }
}
return NG;

The `FIRST` sets

Every CFG has two associated functions, FIRST and FOLLOW, that are helpful for constructing both the top-down and the bottom-up parsers.
Let $G = (V, \Sigma, R, S)$ be a CFG having set of nonterminals $V$, set of terminals $\Sigma$, set of productions $R$, and the start symbol $S$.
Let $\alpha$ be any string of grammar symbols. Define first$(\alpha)$ to be \[ \mbox{first}(\alpha) := \{a\in\Sigma : \alpha \Rightarrow^* a\beta \text{ for some string of grammar symbols }\beta\}. \] Note that first$(a) = \{a\}$ for all $a\in\Sigma$.
The function FIRST$(\alpha)$ is defined to be
\[ \mbox{FIRST}(\alpha) := \left\{ \begin{array}{ll} \mbox{first}(\alpha)\cup\{\epsilon\}, & \mbox{if $\alpha$ is nullable} \\ \mbox{first}(\alpha), & \mbox{otherwise.} \end{array} \right. \] Note that FIRST$(a)$ = first$(a) = \{a\}$ for all $a\in\Sigma$. Also, FIRST$(\epsilon) = \{\epsilon\}$ since first$(\epsilon) = \emptyset$ and $\epsilon$ is nullable.

Computing the `FIRST` sets

We have already noted that FIRST($a$) = first($a$) = $\{a\}$ for all $a\in\Sigma$.
We can compute the FIRST sets of all nonterminals by following these three steps:
1. Compute the set of all nullable nonterminals $N(G)$.
2. For each nonterminal $A$, we compute first($\alpha$) for all nonempty alternatives $\alpha_i$ ($1\le i\le n$) of the $A$-productions $A\to \alpha_1 \mid \alpha_2 \mid \cdots \mid \alpha_n$. The set first($A$) is simply $\bigcup_{i=1}^n$ first($\alpha_i$). Details of this step will be given on next slide.
3. For each nonterminal $A$, compute FIRST($A$) from first($A$) and the nullability or non-nullability of $A$ using the definition of FIRST.
Once we know first($A$) (and thus FIRST($A$)) of all nonterminals $A$, we can compute first($\alpha$) (and thus FIRST($\alpha$)) of any nonempty string of grammar symbols $\alpha$. The algorithm is as follows:
```
Let α = X1 X2 ... Xn;
first_α ← ∅;
for i ← 1 to n do {
    first_α ← first_α ∪ first(Xi);
    if Xi is not nullable then
        return first_α;
}
return first_α;
```

The `first` sets of all nonterminals

Here is my algorithm for computing first($A$) for all nonterminals $A$.

for each nonterminal A ∈ V do
    initialize first(A) to be empty set;
Let H be a digraph whose vertex set is V and empty edge set;
for each nonterminal A ∈ V do {
    for each non-ε-production of A, say, A → X1 X2 ... Xn, do {
        for i ← 1 to n do {
            if Xi is a terminal then
                first(A) ← first(A) ∪ { Xi };
            else
                add a directed edge (Xi,A) to H;
            if Xi is not nullable then break;
        }
    }
}
/* This step contracts all strong components of H. */
while H contains a directed cycle C do {
    contract C to a supervertex and set the first set of
    this supervertex to be equal to the union of first(v)'s
    for all vertices v on C;
}
/* At this point H is a dag! */
topologically sort H;
for each vertex v in H in topological order do {
    for each edge (v,w) do {
        first(w) ← first(w) ∪ first(v);
    }
}

Example computation of the `FIRST` sets

Given the Grammar

E  →  T E'
E' →  + T E' | ε
T  →  F T'
T' →  * F T' | ε
F  →  ( E ) | id

run the algorithm to verify that

FIRST(T) = FIRST(F) = FIRST(E) = { (, id }
FIRST(E') = { +, ε }
FIRST(T') = { *, ε }

The `FOLLOW` sets

Let $A$ be a nonterminal. The function follow(A) is defined to be \[ \mbox{follow}(A) := \{ a\in\Sigma : S\Rightarrow^* \alpha Aa\beta \mbox{ for some strings of grammar symbols } \alpha, \beta \}. \]
Let $ be a special symbol that is not a grammar symbol. We will use $ as end-of-input and bottom-of-stack marker.
The function FOLLOW($A$) is defined to be \[ \mbox{FOLLOW}(A) := \left\{ \begin{array}{ll} \mbox{follow}(A)\cup\{\$\}, & \mbox{if $S\Rightarrow^* \alpha A$ for some string of grammar symbols $\alpha$} \\ \mbox{follow}(A), & \mbox{otherwise.} \end{array} \right. \] Note that $\$\in$ FOLLOW($S$) always.
Remark We can also define the FOLLOW sets for terminals using exactly this same definition. We are focusing on the FOLLOW sets of nonterminals only because they are needed for parsers.

The `FOLLOW` sets of all nonterminals

Here is my algorithm for computing FOLLOW($A$) for all nonterminals $A$.

for each nonterminal A ∈ V do
    initialize FOLLOW(A) to be empty set;
FOLLOW(S) ← { $ };
Let H be a digraph whose vertex set is V and empty edge set;
for each nonterminal A ∈ V do {
    for each non-ε-production of A, say, A → X1 X2 ... Xn, do {
        for i ← n downto 1 do {
            if Xi is a nonterminal then {
                Let β be the string X{i+1} X{i+2} ... Xn;
                FOLLOW(Xi) ← FOLLOW(Xi) ∪ first(β);
                if β is nullable and A ≠ Xi then
                    add a directed edge (A, Xi) to H;
            }
        }
    }
}
/* This step contracts all strong components of H. */
while H contains a directed cycle C do {
    contract C to a supervertex and set the FOLLOW set of
    this supervertex to be equal to the union of FOLLOW(v)'s
    for all vertices v on C;
}
/* At this point H is a dag! */
topologically sort H;
for each vertex v in H in topological order do {
    for each edge (v,w) do {
        FOLLOW(w) ← FOLLOW(w) ∪ FOLLOW(v);
    }
}

Example computation of the `FOLLOW` sets

Given the Grammar

E  →  T E'
E' →  + T E' | ε
T  →  F T'
T' →  * F T' | ε
F  →  ( E ) | id

run the algorithm to verify that

FOLLOW(E) = FOLLOW(E') = { ), $ }
FOLLOW(T) = FOLLOW(T') = { +, ), $ }
FOLLOW(F) = { +, *, ), $ }

LL(1) grammars

A grammar is in LL(1) if for all nonterminals $A$ and for all $A$-productions $A\to\alpha_1 \mid \alpha_2 \mid \cdots \mid \alpha_n$, we have
- ${\rm FIRST}(\alpha_i)\cap {\rm FIRST}(\alpha_j)=\emptyset$ if $i\ne j$
- if for some $i$ it holds that $\alpha_i\Rightarrow^*\epsilon$, then ${\rm FIRST}(\alpha_j)\cap {\rm FOLLOW}(A)=\emptyset$ if $i\ne j$
An LL(1) grammar can be parsed using a predictive parser.
Implementing a predictive parser is aided if we collect all the information from the FIRST and FOLLOW sets into a 2D array $M[\cdot,\cdot]$ such that for any $A\in V$ and for any $a\in\Sigma\cup\{\$\}$, the value $M[A,a]$ gives the production to apply when the parser tries to expand a node $A$ when the lookahead symbol is $a$.

Constructing the $M[\cdot,\cdot]$ table

Algorithm:

for each production A → α do {
    for each terminal a in first(α) do
        M[A,a] ←  M[A,a] ∪ { A → α };
    if α ⇒* ε then {
        for all a ∈ FOLLOW(A) do
            M[A,a] ←  M[A,a] ∪ { A → α };
    }
}

While executing the above algorithm if we discover an entry $M[A,a]$ with more than 1 production, we can abort the algorithm since we have discovered a proof that the grammar is not in LL(1).
Any table entry $M[A,a]$ that remains empty after the algorithm finishes signifies ERROR for the parser, i.e., if the parser using this table is trying to predict which production to apply at node $A$ while the lookahead symbol is $a$, the parser knows that the input is invalid, so it can abort.

An example LL(1) grammar

Verify that this grammar

E  →  T E'
E' →  + T E' | ε
T  →  F T'
T' →  * F T' | ε
F  →  ( E ) | id

is in LL(1) and it has this $M$ table

An example non-LL(1) grammar

Verify that this grammar
```
S  →  i E t S S' | a
S' →  e S | ε
E  →  b
```
is not in LL(1) and it has this $M$ table

Non-recursive predictive parsing

By using a stack to keep track of the parse tree nodes yet to be visited, together with the $M$ table, a predictive parser can be implemented in a non-recursive manner.
The following algorithm assumes the following.
- The $M$ table has already been computed.
- The stack has exactly two symbols—the stack buttom marker $ at the buttom and the start symbol $S$ on top.
- The input string is $w\$$, where $w$ is the real input string and $ is the input end marker.

Algorithm:

a ← first input symbol;
X ← top stack symbol;
while X ≠ $ do {
    if X = a then {
        pop the stack;
        a ← next input symbol;
    } else if X is a terminal then error();
    else if M[X,a] is empty then error();
    else {
        let M[X,a] be X → Y1 Y2 ... Yk;
        output the production M[X,a];
        pop the stack;
        for i ← k downto 1 do
            push Yi on the stack;
    }
    X ← top stack symbol;
}

Sample run

Execution of the non-recursive predictive parsing algorithm for the grammar
```
E  →  T E'
E' →  + T E' | ε
T  →  F T'
T' →  * F T' | ε
F  →  ( E ) | id
```
on the input string id + id * id is depicted here:

Top-down Parsing

San Skulrattanakulchai

February 26, 2019

Top-down parsing

Recursive-descent parsing

Nullability

The nullable set of nonterminals

The `FIRST` sets

Computing the `FIRST` sets

The `first` sets of all nonterminals

Example computation of the `FIRST` sets

The `FOLLOW` sets

The `FOLLOW` sets of all nonterminals

Example computation of the `FOLLOW` sets

LL(1) grammars

Constructing the \(M[\cdot,\cdot]\) table

An example LL(1) grammar

An example non-LL(1) grammar

Non-recursive predictive parsing

Sample run

Top-down Parsing

San Skulrattanakulchai

February 26, 2019

Top-down parsing

Recursive-descent parsing

Nullability

The nullable set of nonterminals

The FIRST sets

Computing the FIRST sets

The first sets of all nonterminals

Example computation of the FIRST sets

The FOLLOW sets

The FOLLOW sets of all nonterminals

Example computation of the FOLLOW sets

LL(1) grammars

Constructing the \(M[\cdot,\cdot]\) table

An example LL(1) grammar

An example non-LL(1) grammar

Non-recursive predictive parsing

Sample run

The `FIRST` sets

Computing the `FIRST` sets

The `first` sets of all nonterminals

Example computation of the `FIRST` sets

The `FOLLOW` sets

The `FOLLOW` sets of all nonterminals

Example computation of the `FOLLOW` sets