# Examples¶

To give a thorough example of what the library allows to implement, in this Jupyter session we follow rather faitfully some Sections of Parsing Techniques, an eccellent book on parsing by Dick Grune and Ceriel J.H. Jacobs. We can omit many details (such as defintions, proofs, careful algorithm explanations…) but the interested reader can still find in such text a very precise, clear and conceputal description of the ideas behind the code implemented in the following examples.

## Type 1 grammars and production graphs¶

Let’s start defining a monotonic grammar for $$a^nb^nc^n$$

from liblet import Grammar

G = Grammar.from_string("""
S -> a b c
S -> a S Q
b Q c -> b b c c
c Q -> Q c
""", False)

G

Grammar(N={Q, S}, T={a, b, c}, P=(S -> a b c, S -> a S Q, b Q c -> b b c c, c Q -> Q c), S=S)


Productions are conveniently represented as a table, with numbered alternatives.

G.P

S a b c(0) | a S Q(1) b b c c(2) Q c(3)

It’s now time to create a derivation of $$a^2b^2c^2$$

from liblet import Derivation

d = Derivation(G).step(1, 0).step(0, 1).step(3, 3).step(2, 2)
d

S -> a S Q -> a a b c Q -> a a b Q c -> a a b b c c


It can be quite illuminating to see the production graph for such derivation

from liblet import ProductionGraph

ProductionGraph(d) ## Context-free grammars and ambiguity¶

Assume we want to experiment with an ambiguous grammar and look for two different leftmost derivation of the same sentence.

To this aim, let’s consider the following grammar and a short derivation leading to and addition of three terminals

G = Grammar.from_string("""
E -> E + E
E -> E * E
E -> i
""")

d = Derivation(G).step(0, 0).step(0, 0)
d

E -> E + E -> E + E + E


What are the possible steps at this point? The possible_steps method comes in handy, here is a (numbered) table of pairs $$(p, q)$$ where $$p$$ is production number and $$q$$ the position of the nonterminal that can be substituted:

possible_steps = list(d.possible_steps())

from liblet import iter2table

iter2table(possible_steps)

0 (0, 0) (0, 2) (0, 4) (1, 0) (1, 2) (1, 4) (2, 0) (2, 2) (2, 4)

If we look for just for leftmost derivations among the $$(p, q)$$s, we must keep just the $$p$$s corresponding to the $$q$$s equal to the minimum of the possible $$q$$ values. The following function can be used to such aim:

from operator import itemgetter

def filter_leftmost_prods(possible_steps):
possible_steps = list(possible_steps)
if possible_steps:
min_q = min(possible_steps, key = itemgetter(1))
return map(itemgetter(0), filter(lambda ps: ps == min_q, possible_steps))
return tuple()

list(filter_leftmost_prods(possible_steps))

[0, 1, 2]


Now, using a Queue we can enumerate all the leftmost productions, we can have a fancy generator that returns a new derivation each time next is called on it:

from liblet import Queue

def derivation_generator(G):
Q = Queue([Derivation(G)])
while Q:
derivation = Q.dequeue()
if set(derivation.sentential_form()) <= G.T:
yield derivation
for nprod in filter_leftmost_prods(derivation.possible_steps()):
Q.enqueue(derivation.leftmost(nprod))


Let’s collect the first 10 derivations

derivation = derivation_generator(G)
D = [next(derivation) for _ in range(10)]
iter2table(D)

0 E -> i E -> E + E -> i + E -> i + i E -> E * E -> i * E -> i * i E -> E + E -> E + E + E -> i + E + E -> i + i + E -> i + i + i E -> E + E -> E * E + E -> i * E + E -> i * i + E -> i * i + i E -> E + E -> i + E -> i + E + E -> i + i + E -> i + i + i E -> E + E -> i + E -> i + E * E -> i + i * E -> i + i * i E -> E * E -> E + E * E -> i + E * E -> i + i * E -> i + i * i E -> E * E -> E * E * E -> i * E * E -> i * i * E -> i * i * i E -> E * E -> i * E -> i * E + E -> i * i + E -> i * i + i

As one can easily see, derivations 6 and 7 produce the same sentence i + i * i but evidently with two different leftmost derivations. We can give a look at the production graphs to better see what is happening.

from liblet import side_by_side

side_by_side(ProductionGraph(D), ProductionGraph(D))


## Hygiene in Context-Free Grammars¶

First of all, let’s start with a series of techniques to clean a context-free grammar by removing unreachable, non-productive, and undefined symbols. Let’s start with the context-free grammar $$G$$ of Figure 2.25 at page 49 of Parsing Techniques, in particular we’ll be following the flow of Sections 2.9.1, 2.9.2 and 2.9.5.

G = Grammar.from_string("""
S -> A B | D E
A -> a
B -> b C
C -> c
D -> d F
E -> e
F -> f D
""")


We can use the @closure decorator to obtain the productive symbols by extending at every round the set prod of productive symbols as {A for A, α in G.P if set(α) <= prod}, that is taking all the left-hand sides of productions whose left-hand sides are in turn made of productive symbols.

from liblet import closure

def find_productive(G):
@closure
def find(prod):
return prod | {A for A, α in G.P if set(α) <= prod}
return set(find(G.T))

find_productive(G)

{'A', 'B', 'C', 'E', 'S', 'a', 'b', 'c', 'd', 'e', 'f'}


Similarly, we can obtain the reachable symbols by extending at every round the set reach of reachable symbols as union_of(set(α) for A, α in G.P if A in reach)}, that is taking the union all the left-hand sides of productions whose left-hand sides are in turn reachable.

from liblet import union_of

def find_reachable(G):
@closure
def find(reach, G):
return reach | union_of(set(α) for A, α in G.P if A in reach)
return find({G.S}, G)

find_reachable(G)

{'A', 'B', 'C', 'D', 'E', 'F', 'S', 'a', 'b', 'c', 'd', 'e', 'f'}


To clean the grammar one has first to eliminate the non-productive symbols and the the non-reachable onse (as acting in the reverse order can leave around non-reachable symbols after the first removal).

def remove_unproductive_unreachable(G):
Gp = G.restrict_to(find_productive(G))
return Gp.restrict_to(find_reachable(Gp))

remove_unproductive_unreachable(G)

Grammar(N={A, B, C, S}, T={a, b, c}, P=(S -> A B, A -> a, B -> b C, C -> c), S=S)


To remove undefined nonterminals is easy, it’s enough to collect the ones appearing as left-hand side in some production and throw away the others

def remove_undefined(G):
return G.restrict_to({A for A, α in G.P} | G.T)


Given that Grammar.from_string considers nonterminal just the symbols on the left-hand sides, to check that the last method works we need to build a grammar in another way:

from liblet import Production

Gu = Grammar({'S', 'T'}, {'s'}, (Production('S', ('s',)),), 'S')
Gu

Grammar(N={S, T}, T={s}, P=(S -> s,), S=S)

remove_undefined(Gu)

Grammar(N={S}, T={s}, P=(S -> s,), S=S)


Observe that undefined symbols are non-productive, hence remove_unproductive_unreachable will take implicitly care of them.

## The Chomsky Normal Form¶

Now that the grammar contains only defined, productive and reachable symbols, to get to the CHomsky normal form we need to take care of ε-rules and unit rules (following Section 4.2.3 of Parsing Techniques).

### Elimination of ε-rules¶

The elimination of ε-rules is performed in a series of consecutive steps, adding new nonterminals and productions.

As an example grammar we use the one of Figure 4.10 at page 120.

G = Grammar.from_string("""
S -> L a M
L -> L M
L -> ε
M -> M M
M -> ε
""")


Given a rule $$A\to ε$$ we look for rules of the form $$B\to αAβ$$ and “inline” the ε-rule by adding two new rules $$B\to αA'β$$ and $$B\to αβ$$ where $$A'$$ is a new nonterminal; this of course need to be iterated (in a closure) to cope with productions where $$A$$ appears more than once in the left-hand side.

@closure
def replace_in_rhs(G, A):
Ap = A + '′'
prods = set()
for B, β in G.P:
if A in β:
pos = β.index(A)
rhs = β[:pos] + β[pos + 1:]
if len(rhs) == 0: rhs = ('ε', )
prods.add(Production(B, β[:pos] + (Ap, ) + β[pos + 1:]))
else:
return Grammar(G.N | {Ap}, G.T, prods, G.S)

Gp = replace_in_rhs(G, 'M')
Gp.P

L L(0) | ε(5) | L M′(6) M′(1) | ε(2) | M′ M′(7) L a M′(3) | L a(4)

The above procedure must be repeated for evey ε-rule, moreover since the process can intruduce new ε-rules, a closure is again needed.

@closure
def inline_ε_rules(G_seen):
G, seen = G_seen
for A in G.N - seen:
if ('ε', ) in G.alternatives(A):
return replace_in_rhs(G, A), seen | {A}
return G, seen

Gp, _ = inline_ε_rules((G, set()))
Gp.P

M M′(0) | ε(3) | M′ M′(10) L′ a M′(1) | a M′(5) | a(6) | L′ a(7) L′(2) | M′(4) | ε(8) | L′ M′(9)

The left-hand sides of the ε rules now are unreachable, but the new “primed” nonterminals must now be defined, using the non-empty left-hand sides of the one they inlined.

def eliminate_ε_rules(G):
Gp, _ = inline_ε_rules((G, set()))
prods = set(Gp.P)
for Ap in Gp.N - G.N:
A = Ap[:-1]
for α in set(Gp.alternatives(A)) - {('ε', )}:
return Grammar(Gp.N, Gp.T, prods, Gp.S)

eliminate_ε_rules(G).P

L′ L′ M′(0) | M′(13) | L′(14) M′(1) | ε(4) | M′ M′(15) L′ a M′(2) | a M′(6) | a(7) | L′ a(8) L′(3) | M′(5) | ε(9) | L′ M′(10) M′ M′(11) | M′(12)

Removing the unreachable and non-productive rules leads to quite a drastic simplification!

remove_unproductive_unreachable(eliminate_ε_rules(G))

Grammar(N={S}, T={a}, P=(S -> a,), S=S)


### Elimination of unit rules¶

To see what happens dealing with rules of the form $$A\to B$$ we’ll refer to a more complex grammar, the one of Figure 4.6 at page 112.

G = Grammar.from_string("""
Number -> Integer | Real
Integer -> Digit | Integer Digit
Real -> Integer Fraction Scale
Fraction -> . Integer
Scale -> e Sign Integer | Empty
Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Sign -> + | -
Empty -> ε
""")
G.P

Number Integer(0) | Real(1) Digit(2) | Integer Digit(3) Integer Fraction Scale(4) . Integer(5) e Sign Integer(6) | Empty(7) 0(8) | 1(9) | 2(10) | 3(11) | 4(12) | 5(13) | 6(14) | 7(15) | 8(16) | 9(17) +(18) | -(19) ε(20)

We start by applying all the cleaning steps seen so far.

Gorig = G
G = remove_unproductive_unreachable(eliminate_ε_rules(G))
G.P

Digit 9(0) | 0(2) | 3(3) | 7(7) | 4(8) | 8(9) | 6(11) | 1(13) | 2(18) | 5(19) e Sign Integer(1) Digit(4) | Integer Digit(10) -(5) | +(16) Integer Fraction Scale′(6) | Integer Fraction(17) Integer(12) | Real(14) . Integer(15)

The elimination of the unit rules is based again on a closure that replaces $$A\to B$$ and $$B\to α$$ with $$A\to α$$.

def eliminate_unit_rules(G):
@closure
def clean(G_seen):
G, seen = G_seen
for P in set(filter(Production.such_that(rhs_len = 1), G.P)) - seen:
A, (B, ) = P
if B in G.N:
prods = (set(G.P) | {Production(A, α) for α in G.alternatives(B)}) - {P}
return Grammar(G.N, G.T, prods, G.S), seen | {P}
return G, seen
return clean((G, set()))

G = eliminate_unit_rules(G)
G.P

Number Integer Digit(0) | 7(1) | Integer Fraction(3) | 4(6) | 8(7) | 6(9) | 2(13) | 1(14) | 5(16) | 9(18) | 0(24) | Integer Fraction Scale′(30) | 3(33) 9(2) | 0(12) | 3(17) | 2(19) | 7(23) | 4(25) | 8(27) | 6(29) | 1(38) | 5(39) 2(4) | 1(5) | 5(8) | 9(11) | 0(15) | 3(20) | Integer Digit(26) | 7(28) | 6(35) | 4(36) | 8(37) e Sign Integer(10) -(21) | +(32) Integer Fraction Scale′(22) | Integer Fraction(34) . Integer(31)

### The normal form¶

Two last cases need to be taken care of to get to the CNF.

First we want to eliminate non-solitary terminals in left-hand sides, that is if $$A\to αaβ$$ where $$a\in T$$ and $$α, β\in N^*$$; this is easily solved by introducing a new nonterminal $$N_a$$ and a new rule $$N_a\to a$$, replacing the offending $$A\to αaβ$$ with $$A\to αN_aβ$$.

def transform_nonsolitary(G):
prods = set()
for A, α in G.P:
prods.add(Production(A, [f'N{x}' if x in G.T else x for x in α] if len(α) > 1 else α))
prods |= {Production(f'N{x}', (x, )) for x in α if x in G.T and len(α) > 1}
return Grammar(G.N | {A for A, α in prods}, G.T, prods, G.S)

G = transform_nonsolitary(G)
G.P

Number Integer Digit(0) | Integer Fraction(1) | 7(2) | 4(7) | 8(9) | 6(11) | 2(14) | 1(15) | 5(17) | 9(19) | 0(25) | Integer Fraction Scale′(32) | 3(35) 9(3) | 0(13) | 3(18) | 7(24) | 4(26) | 8(29) | 6(31) | 1(33) | 2(40) | 5(41) e(4) 2(5) | 1(6) | 5(10) | 9(12) | 0(16) | 3(21) | Integer Digit(27) | 7(30) | 6(37) | 4(38) | 8(39) N. Integer(8) .(20) -(22) | +(34) Integer Fraction Scale′(23) | Integer Fraction(36) Ne Sign Integer(28)

Finally we need to shorten left-hand sides longer than 2 symbols. Again that is easily accomplished by introducing new nonterminals and rules.

def make_binary(G):
prods = set()
for A, α in G.P:
if len(α) > 2:
Ai = '{}{}'.format(A, 1)
for i, Xi in enumerate(α[2:-1], 2):
prods.add(Production('{}{}'.format(A, i), (Ai, Xi)))
Ai = '{}{}'.format(A, i)
else:
return Grammar(G.N | {A for A, α in prods}, G.T, prods, G.S)

G = make_binary(G)
G.P

Number Integer Digit(0) | 7(1) | Integer Fraction(2) | 4(8) | 8(10) | 6(12) | 2(16) | 1(17) | 5(19) | 9(23) | 0(29) | Number1 Scale′(35) | 3(37) 9(3) | 0(14) | 3(22) | 2(25) | 7(28) | 4(30) | 8(32) | 6(34) | 1(43) | 5(44) e(4) Ne Sign(5) 2(6) | 1(7) | 5(11) | 9(13) | 0(18) | 3(26) | Integer Digit(31) | 7(33) | 6(40) | 4(41) | 8(42) N. Integer(9) Integer Fraction(15) Real1 Scale′(20) | Integer Fraction(39) Integer Fraction(21) .(24) -(27) | +(36) Scale′1 Integer(38)

## The Cocke, Younger, and Kasami algorithm¶

Following the CYK description given in Section 4.2.2 of Parsing Techniques we implement the algoritm by means of a dictionary R that, for the key $$(i, l)$$, records the left-hand sides of productions deriving $$s_{il}$$ that is the substring of the input starting at $$i$$ and having length $$l$$.

from liblet import CYKTable

def cyk(G, INPUT):
def fill(R, i, l):
res = set()
if l == 1:
for A, (a,) in filter(Production.such_that(rhs_len = 1), G.P):
if a == INPUT[i - 1]:
else:
for k in range(1, l):
for A, (B, C) in filter(Production.such_that(rhs_len = 2), G.P):
if B in R[i, k] and C in R[i + k, l - k]:
return res
R = CYKTable()
for l in range(1, len(INPUT) + 1):
for i in range(1, len(INPUT) - l + 2):
R[(i, l)] = fill(R, i, l)
return R

INPUT = tuple('32.5e+1') # remember: words are sequences of strings!
R = cyk(G, INPUT)
R

 Number Real   Number Real       Number Number1 Real Real1         Number Number1 Real Real1     Scale′ Integer Number   Fraction   Scale′1   Digit Integer Number Digit Integer Number N. Digit Integer Number Ne Sign Digit Integer Number

### Getting the derivation from the table¶

Once the table is filled, it’s easy to get a leftmost production by recursing in the table following the same logic used to fill it.

from liblet import show_calls

def get_leftmost_prods(G, R, INPUT):
@show_calls(True)
def prods(X, i, l):
if l == 1:
return [G.P.index(Production(X, (INPUT[i - 1],)))]
for A, (B, C) in filter(Production.such_that(lhs = X, rhs_len = 2), G.P):
for k in range(1, l):
if B in R[i, k] and C in R[i + k, l - k]:
return [G.P.index(Production(A, (B, C)))] + prods(B, i, k) + prods(C, i + k, l - k)
return prods(G.S, 1, len(INPUT))

prods = get_leftmost_prods(G, R, INPUT)

┌prods('Number', 1, 7)
│┌prods('Number1', 1, 4)
││┌prods('Integer', 1, 2)
│││┌prods('Integer', 1, 1)
│││└─ 
│││┌prods('Digit', 2, 1)
│││└─ 
││└─ [31, 26, 25]
││┌prods('Fraction', 3, 2)
│││┌prods('N.', 3, 1)
│││└─ 
│││┌prods('Integer', 4, 1)
│││└─ 
││└─ [9, 24, 11]
│└─ [15, 31, 26, 25, 9, 24, 11]
│┌prods('Scale′', 5, 3)
││┌prods('Scale′1', 5, 2)
│││┌prods('Ne', 5, 1)
│││└─ 
│││┌prods('Sign', 6, 1)
│││└─ 
││└─ [5, 4, 36]
││┌prods('Integer', 7, 1)
││└─ 
│└─ [38, 5, 4, 36, 7]
└─ [35, 15, 31, 26, 25, 9, 24, 11, 38, 5, 4, 36, 7]

d = Derivation(G)
for step in prods: d = d.leftmost(step)
ProductionGraph(d) ### Undoing the grammar transformation¶

Following section 4.2.6 of Parsing Techniques, one can undo the CNF transformation keeping track in R of symbols that became useless after the the elimination of ε-rules and unit rules, that is we clean the original grammar but avoid the remove_unproductive_unreachable step.

Gp = eliminate_unit_rules(eliminate_ε_rules(Gorig))
Gp = transform_nonsolitary(make_binary(Gp))
Gp.P

Empty ε(0) Integer Digit(1) | Integer Fraction(2) | 7(3) | 4(9) | 8(11) | 6(13) | 2(18) | 1(20) | 5(21) | 9(25) | 0(32) | Number1 Scale′(38) | 3(40) 9(4) | 0(16) | 3(24) | 2(28) | 7(31) | 4(33) | 8(35) | 6(37) | 1(47) | 5(48) e(5) Ne Sign(6) 1(7) | 2(8) | 5(12) | 9(14) | 0(19) | 3(29) | Integer Digit(34) | 7(36) | 6(43) | 4(44) | 8(46) N. Integer(10) Scale1 Integer(15) | ε(27) Integer Fraction(17) Real1 Scale′(22) | Integer Fraction(42) Integer Fraction(23) .(26) -(30) | +(39) Scale′1 Integer(41) Ne Sign(45)

We again perform the parsing, this time saving the results in Roirg table, to which add the end we add a last line with the ε-rules Rε.

Rorig = cyk(Gp, INPUT)

Rε = {A for A in Gp.N if ('ε', ) in Gp.alternatives(A)}
for i in range(1, len(INPUT) + 2): Rorig[i, 0] = Rε

Rorig

 Number Real   Number Real       Number Number1 Real Real1         Number Number1 Real Real1     Scale Scale′ Integer Number   Fraction   Scale1 Scale′1   Digit Integer Number Digit Integer Number N. Digit Integer Number Ne Sign Digit Integer Number Empty Scale Empty Scale Empty Scale Empty Scale Empty Scale Empty Scale Empty Scale Empty Scale

To recover the parse tree, we need a recursive function derives(ω, i, l) (depending on the grammar and the parse table) that for a given substring $$ω\in (T\cup N)^*$$ returns a lst if $$ω$$ derives the substring $$s_{il}$$, or None otherwise, where lst is a list $$\lambda_0, \lambda_1, \lambda_{l-1}$$ such that $$\lambda_i$$ is the length of the substring derived by $$w_i$$.

def make_derives(R, INPUT):
def derives(ω, i, l):
if not ω or ('ε', ) == ω: return [] if l == 0 else None
X, *χ = ω
if X in G.T:
if i <= len(INPUT) and X == INPUT[i - 1]:
s = derives(χ, i + 1, l - 1)
if s is not None: return  + s
else:
for k in range(0, l + 1):
if X in R[i, k]:
s = derives(χ, i + k, l - k)
if s is not None: return [k] + s
return None
return derives


We can for instance test that Integer Fraction Scale derives $$s_{1,4} =$$ 32.5 as

derives = make_derives(Rorig, INPUT)
derives(['Integer', 'Fraction', 'Scale'], 1, 4)

[2, 2, 0]


That tells us that Integer derives the first 2 input symbols 32, then Fraction derives the last 2 symbols .5 and finally Scale derives the empty string.

Endowed with such function, it is easy to adatp get_leftmost_prods so that it works also for the productions of the original grammar, that are not in CNF (and can hence have arbitrary length and contain non-solitary terminals).

def get_original_leftmost_prods(G, derives, N):
@show_calls(True)
def prods(X, i, l):
if X in G.T: return []
for A, α in filter(Production.such_that(lhs = X), G.P):
d = derives(α, i, l)
if d is None: continue
res = [G.P.index(Production(A, α))]
for B, l in zip(α, d):
res.extend(prods(B, i, l))
i += l
return res
return prods(G.S, 1, N)

prods_orig = get_original_leftmost_prods(Gorig, derives, len(INPUT))
prods_orig

┌prods('Number', 1, 7)
│┌prods('Real', 1, 7)
││┌prods('Integer', 1, 2)
│││┌prods('Integer', 1, 1)
││││┌prods('Digit', 1, 1)
│││││┌prods('3', 1, 1)
│││││└─ []
││││└─ 
│││└─ [2, 11]
│││┌prods('Digit', 2, 1)
││││┌prods('2', 2, 1)
││││└─ []
│││└─ 
││└─ [3, 2, 11, 10]
││┌prods('Fraction', 3, 2)
│││┌prods('.', 3, 1)
│││└─ []
│││┌prods('Integer', 4, 1)
││││┌prods('Digit', 4, 1)
│││││┌prods('5', 4, 1)
│││││└─ []
││││└─ 
│││└─ [2, 13]
││└─ [5, 2, 13]
││┌prods('Scale', 5, 3)
│││┌prods('e', 5, 1)
│││└─ []
│││┌prods('Sign', 6, 1)
││││┌prods('+', 6, 1)
││││└─ []
│││└─ 
│││┌prods('Integer', 7, 1)
││││┌prods('Digit', 7, 1)
│││││┌prods('1', 7, 1)
│││││└─ []
││││└─ 
│││└─ [2, 9]
││└─ [6, 18, 2, 9]
│└─ [4, 3, 2, 11, 10, 5, 2, 13, 6, 18, 2, 9]
└─ [1, 4, 3, 2, 11, 10, 5, 2, 13, 6, 18, 2, 9]

[1, 4, 3, 2, 11, 10, 5, 2, 13, 6, 18, 2, 9]

d = Derivation(Gorig)
for step in prods_orig: d = d.leftmost(step)
ProductionGraph(d) 