API

Productions, Grammars and Derivations

The basic building block of a grammar is a production represented by the following class.

class Production(lhs, rhs)

A grammar production.

This class represents a grammar production, it has a left-hand and right-hand sides that can be nonempty strings, or tuples of nonempty strings; the right-hand side can contain ε only if is the only symbol it comprises. A production is iterable and unpacking can be used to obtain its sides, so for example

>>> lhs, rhs = Production('A', ('B', 'C'))
>>> lhs
'A'
>>> rhs
('B', 'C')
Parameters
  • lhs (str or tuple of str) – The left-hand side of the production.

  • rhs (str or tuple of str) – The right-hand side of the production.

Raises

ValueError – in case the left-hand or right-hand side is not a strings, or tuples of strings.

A convenience method is provided to obtain productions from a suitable string representation of them.

classmethod from_string(prods, context_free=True)

Builds a tuple of productions obtained from the given string.

Parameters
  • prods (str) – a string representing the set of productions.

  • context_free (bool) – if True all the left-hand sides will be strings (not tuples).

The string must be a sequence of lines of the form:

lhs -> alternatives

where alternatives is a list of rhs strings (possibly separated by |) and lhs and rhs are space separated strings that will be used as left-hand and right-hand sides of the returned productions; for example:

S -> ( S ) | x T y
x T y -> t
Raises

ValueError – in case the productions are declared as context_free but on of them has more than one symbol on the right-hand side.

In order to find productions satisfying a set of properties, a convenience method to build a predicate from a set of keyword arguments is provided.

classmethod such_that(**kwargs)

Returns a conjunction of predicates specified by its named arguments.

This method returns a predicate that can be conveniently used with filter() to

Parameters
  • lhs – returns a predicate that is True weather the production left-hand side is equal to the argument value.

  • rhs – returns a predicate that is True weather the production left-hand side is equal to the argument value.

  • rhs_len – returns a predicate that is True weather the length of the production left-hand side is equal to the argument value.

  • rhs_is_suffix_of – returns a predicate that is True weather the the argument value ends with the production.

Returns

A predicate (that is a one-argument function that retuns True or False) that is True weather the production given as argument satisfies all the predicates given by the named arguments.

Example

As an example, consider the following productions and the subset of them you can obtaining by filtering them according to different predicates

>>> prods = Production.from_string("A -> B C\nB -> b\nC -> D")
>>> list(filter(Production.such_that(lhs = 'B'), prods))
[B -> b]
>>> list(filter(Production.such_that(rhs = ['B', 'C']), prods))
[A -> B C]
>>> list(filter(Production.such_that(rhs_len = 1), prods))
[B -> b, C -> D]
>>> list(filter(Production.such_that(rhs_is_suffix_of = ('a', 'B', 'C')), prods))
[A -> B C]
>>> list(filter(Production.such_that(lhs = 'B', rhs_len = 1), prods))
[B -> b]

For the purpose of presenting the Knuth Automaton in the context of LR(0) parsing, productions are extended to include a dot in the following class.

class Item(lhs, rhs, pos=0)

A dotted production, also known as an item.

>>> item = Item('A', ('B', 'C'), 1)
>>> item
A -> B•C
>>> lhs, rhs, pos = item
>>> lhs
'A'
>>> rhs
('B', 'C')
>>> pos
1
Parameters
  • lhs (str or tuple of str) – The left-hand side of the production.

  • rhs (str or tuple of str) – The right-hand side of the production.

  • pos (int) – the position of the dot (optional, 0 if absent).

Raises

ValueError – in case the left-hand , or right-hand side is not a tuple of strings, or the dot pos is invalid.

A grammar can be represented by the following class, that can be instantiated given the usual formal definition of a grammar as a tuple.

class Grammar(N, T, P, S)

A grammar.

This class represents a formal grammar, that is a tuple \((N, T, P, S)\) where \(N\) is the finite set of nonterminals or variables symbols, \(T\) is the finite set of terminals, \(P\) are the grammar productions or rules and, \(S \in N\) is the start symbol.

Parameters
  • N (set) – the grammar nonterminals.

  • T (set) – the grammar terminals.

  • P (tuple) – the grammar productions.

  • S (str) – the grammar start symbol.

Even in the case of grammars, a convenience method is provided to build a grammar from a suitable string representation. Once the productions are obtained with Production.from_string(), the remaining parameters are conventionally inferred as detailed below.

classmethod from_string(prods, context_free=True)

Builds a grammar obtained from the given productions.

Parameters
  • prods (str) – a string describing the productions.

  • context_free (bool) – if True the grammar is expected to be context-free.

Once the productions are determined via a call to Production.from_string(), the remaining defining elements of the grammar are obtained as follows:

  • if the grammar is not context-free the nonterminals is the set of symbols, appearing in (the left-hand, or right-hand side of) any production, beginning with an uppercase letter, the terminals are the remaining symbols. The start symbol is the left-hand side of the first production;

  • if the grammar is context-free the nonterminals is the set of symbols appearing in a left-hand side of any production, the terminals are the remaining symbols. The start symbol is the left-hand side of the first production.

An utility method is provided to enumerate alternatives for a given left-hand side:

alternatives(N)

Yields al the right-hand sides alternatives matching the given nonterminal.

Parameters

N (str or tuple of str) – the right-hand side to match.

Yields

the right-hand sides of all productions having N as the left-hand side.

To implement grammar “cleanup strategies” a method to eliminate unwanted symbols (and productions) from a grammar is also provided.

restrict_to(symbols)

Returns a grammar using only the given symbols.

Parameters

symbols – the only allowed symbols.

Returns

a new grammar containing only the given symbols (and hence no production of the original grammar that uses a symbol not in the given set.)

Return type

Grammar

Once a grammar has been defined, one can build derivations with the help of the following class.

class Derivation(G)

A derivation.

This class is immutable, derivations can be built invoking step(), leftmost(), and rightmost()

Parameters

G (Grammar) – the grammar to which the derivations refers to.

step(prod, pos)

Performs a derivation step, returning a new derivation.

Applies the specified production to the given position in the sentential form.

Parameters
  • prod (int) – the production to apply, that is G.P[prod].

  • pos (int) – the position (in the current sentential form) where to apply the production.

Returns

A derivation obtained applying the specified production to the present production.

Return type

Derivation

Raises

ValueError – in case the production can’t be applied at the specified position.

leftmost(prod)

Performs a leftmost derivation step.

Applies the specified production to the current leftmost nonterminal in the sentential form.

Parameters

prod (int) – the production to apply, that is G.P[prod].

Returns

A derivation obtained applying the specified production to the present production.

Return type

Derivation

Raises

ValueError – in case the leftmost nonterminal isn’t the left-hand side of the given production.

rightmost(prod)

Performs a rightmost derivation step.

Applies the specified production to the current rightmost nonterminal in the sentential form.

Parameters

prod (int) – the production to apply, that is G.P[prod].

Returns

A derivation obtained applying the specified production to the present production.

Return type

Derivation

Raises

ValueError – in case the rightmost nonterminal isn’t the left-hand side of the given production.

possible_steps(prod=None, pos=None)

Yields all the possible steps that can be performed given the grammar and current sentential form.

Determines all the position of the sentential form that correspond to the left-hand side of one of the production in the grammar, returning the position and production number. If a production is specified, it yields only the pairs referring to it; similarly, if a position is specified, it yields only the pairs referring to it.

Parameters
  • prod (int) – the production whose left-hand side is to be searched (that is G.P[prod].lhs`) in the sentential form.

  • pos (int) – the position where to look for grammar productions that have a matching left-hand side.

Yields

Pairs of (pord, pos) that can be used as step() argument.

steps()

Returns the steps of the derivation.

Returns: a tuple of (prod, pos) pairs corresponding to this derivation steps.

sentential_form()

Returns the sentential form of the derivation.

Returns: a tuple of (strings representing) grammar symbols corresponding to the sentential form of this derivation steps.

Derivations can be displayed using a ProductionGraph.

Transitions and Automata

Albeit in the context of parsing (that is the main focus of this library), the role of finite state automata and regular grammars is not the main focus, a couple of classes to handle them is provided.

First of all, a basic implementation of a transition is provided; as in the case of grammars, the symbols and terminals are simply strings.

class Transition(frm, label, to)

An automaton transition.

This class represents an automaton transition; it has a frm starting state and a to destination state and a label, the states can be:

whereas the label is a str. A transition is iterable and unpacking can be used to obtain its components, so for example

>>> frm, label, to = Transition({'A', 'B'}, 'c', {'D'})
>>> sorted(frm)
['A', 'B']
>>> label
'c'
>>> to
{'D'}
Parameters
  • frm (str or set of str) – The starting state(s) of the transition.

  • label (str) – The label of the transition.

  • to (str or set of str) – The destination state(s) of the transition.

Raises

ValueError – in case the frm or to states are not nonempty strings, or nonempty set of nonempty strings, or the label is not a nonempty string.

A convenience method is provided to obtain transitions from a suitable string representation of them.

classmethod from_string(transitions)

Builds a tuple of transitions obtained from the given string.

Parameters

trasitions (str) – a string representing transitions.

The string must be a sequence of lines of the form:

frm, label, to

where the parts are strings not containing spaces.s

From transitions one can obtain a representation of a (nondeterministic) finite state automata using the following class.

class Automaton(N, T, transitions, q0, F)

An automaton.

This class represents a (nondeterministic) finite automaton.

Parameters
  • N (set) – The states of the automaton.

  • T (set) – The transition labels.

  • transitions (set of Transition) – The transition of the automata.

  • q0 – The starting state of the automaton.

  • F (set) – The set of final states.

Even for the automata there are a couple of convenience methods to obtain automata from a string representation of the transitions, or from a regular grammar.

classmethod from_string(transitions, F=None, q0=None)

Builds an automaton obtained from the given transitions.

Parameters
  • transitions (str) – a string describing the productions.

  • F (set) – The set of final states.

  • q0 (str) – The starting state of the automaton (inferred from transitions if None).

Once the transitions are determined via a call to Transition.from_string(), the starting state (if not specified) is defined as the frm state of the first transition.

classmethod from_grammar(G)

Builds the automaton corresponding to the given regular grammar.

Parameters

G (Grammar) – the regular grammar to derive the automata from.

The method works under the assumption that the only rules of the grammar are of the form \(A\to aB\), \(A\to B\), \(A\to a\), and \(A\to ε\).

An utility method is provided to represent the transition function of the automata.

δ(X, x)

The transition function.

This function returns the set of states reachable from the given state and input symbol.

Parameters
  • X – the state.

  • x – the input symbol.

Returns

The set of next states of the automaton.

Automata can be displayed using StateTransitionGraph.from_automaton.

ANTLR support

This library provides a commodity class to deal with ANTLR for the Python 3 target.

class ANTLR(grammar)

An utility class representing an ANTLR (v4) grammar and code generated from it.

Given the grammar the constructor of this class generates the code for the lexer, parser, visitor, and listener using the ANTLR tool; if no errors are found, it then loads the modules (with their original names) and stores a reference to them in the attributes named Lexer, Parser, Visitor, and Listener. If the constructor is called a second time, it correctly unloads the previously generated modules (so that the current one are correctly corresponding to the current grammar). Moreover, to facilitate debugging the grammar it keeps track both of the grammar source (in an attribute of the same name) and of the sources of the generated modules, in an attribute named source that is a dict indexed by Lexer, Parser, Visitor, and Listener.

Parameters

grammar (str) – the grammar to process (in ANTLR v4 format).

Raises

ValueError – if the grammar does not contain the name.

print_grammar(number_lines=True)

Prints the grammar (with line numbers)

Parameters

number_lines (bool) – if False line numbers will not be printed.

context(text, symbol, trace=False, diag=False, build_parse_trees=True, as_string=False)

Returns an object subclass of a antlr4.ParserRuleContext corresponding to the specified symbol (possibly as a string).

Parameters
  • text (str) – the text to be parsed.

  • symbol (str) – the symbol (rule name) the parse should start at.

  • trace (bool) – if True the method antlr4.Parser.setTrace will be used to activate tracing.

  • diag (bool) – if True the parsing will be run with a antlr4.error.DiagnosticErrorListener setting the prediction mode to antlr4.atn.PredictionMode.LL_EXACT_AMBIG_DETECTION.

  • build_parse_trees (bool) – if False the field antlr4.Parser.buildParseTrees will be set to False so that no trees will be generated.

  • as_string (bool) – if True the method will return the string representation of the context obtained using its toStringTree method.

tokens(text)

Returns a list of tokens.

This method uses the lexer wrapped in a antlr4.CommonTokenStream to obtain the list of tokens (calling the fill method).

Parameters

text (str) – the text to be processed by the lexer.

tree(text, symbol, simple=False)

Returns an annotated Tree representing the parse tree (derived from the parse context).

Unless a simple tree is required, the returned is an annotated tree whose nodes store context information from the parsing process, more precisely, nodes are dicts with the following keys:

  • type: can be rule or token,

  • name: the rule label or token name (or the rule name if it has no label: or, similarly, the token itself if it has no name),

  • value: the token value (only present for tokens named in the lexer part of the grammar),

  • rule: the rule name (only present for rules, will be different from name for labelled rules).

Note that the values are strings (so if the value is a number, it should be converted before usage).

This method removes any <EOF> tree returned in the underlying parse tree (unless a simple tree is required).

Parameters
  • text (str) – the text to be parsed.

  • symbol (str) – the symbol (rule name) the parse should start at.

  • simple (bool) – if True the returned tree nodes will be strings (with no context information).

class AnnotatedTreeWalker(key, catchall=None)

A dispatch table based way to recursively walk an annotated tree.

The aim of this class is to provide a convenient way to recursively apply functions to the subtrees of a given tree according to their root annotations.

More precisely, once an object of this class is instantiated it can be used to register a set of functions named as the values of a specified key contained in the annotated tree root; trees that don’t have a corresponding registered function are handled via a catchall function (few sensible defaults are provided).

Once such functions are registered, the object can be used as a callable and invoked on an annotated tree; this will lead to the recursive invocation of the registered (and the catchall) functions on subtrees. Such functions are invoked with the object as the first argument (and the subtree as second); in this way they can recurse on their subtrees as needed.

Parameters
  • key (str) – the key that will be looked up in the node dict to determine which registered function to call.

  • catchall (callable) – the catchall function that will be invoked in case the root is not a dict, or does not contain the key, or no function has been registered for the value corresponding to the key.

catchall(func)

A decorator used to register a catchall function.

The decorated function must have two arguments: the first will be always an instance of this object, the second the annotate tree on which to operate. The previous registered catchall function will be overwritten.

register(func)

A decorator used to register a function.

The function will be registered with his name (possibly replacing a previous registration). The decorated function must have two arguments: the first will be always an instance of this object, the second the annotate tree on which to operate.

static RECOURSE_CHILDREN(visit, tree)

A default catchall that will invoke the recursion on all the subtrees; this function does not emit any warning and does not return any value.

static TREE_CATCHALL(visit, tree)

A default catchall that will invoke the recursion on all the subtrees, emitting a warn() and returning a tree with the same root of the given tree and having as children the recursively visited children of the given tree.

static TEXT_CATCHALL(visit, tree)

A default catchall that will invoke the recursion on all the subtrees, emitting a warn() and returning a string with the same root of the given tree and having as children the (indented string representation of the) recursively visited children of the given tree.

Rich display

class Graph(arcs, sep=None)
classmethod from_adjdict(adjdict)
class Tree(root, children=None)

A n-ary tree with ordered children.

The tree stores its root and children in two fields of the same name. If the tree root is a dict the tree is an annotated tree. This kind of trees arise from parsing, for example using the tree() method of the ANTLR class.

A tree is represented as a string as a (<root>: <children>) where <root> is the string representation of the root node content and <children> is the recursively obtained string representation of its children.

It also as a representation as a graph; in such case, if the tree is annotated it will be rendered by Graphviz using HTML-Like Labels built as a table where each dictionary item corresponds to a row with two columns containing the key and value pair of such item.

Parameters
  • root – the root node content (can be of any type).

  • children – an iterable of trees to become the current tree children.

classmethod from_lol(lol)

Builds a tree from a list of lists.

A list of lists (lol) is a list recursively defined such that the first element of a lol is the tree node content and the following elements are lols.

Parameters

lol (list) – a list representing a tree.

Examples

The following code

Tree.from_lol([{'this': 'is', 'an': 2},[3, ['example']], [5]])

produces the following tree

_images/tree.svg
with_threads(threads)

Draws a threaded and annotated tree (as a graph).

Tree threads are arcs among tree nodes (or special nodes), more precisely, a dict mapping annotated tree source nodes to a dict whose values are destinations tree nodes; arcs are represented as dotted arrows. Special nodes are: the start tree (an annotated tree whose root contains the (type, <START>) item), the join trees (annotated trees whose root contains the (type, <JOIN>) item) and the None value; such special nodes are represented as red dots.

Args: threads (dict): a dictionary representing the threads.

class ProductionGraph(derivation, compact=None)
class StateTransitionGraph(transitions, S=None, F=None, large_labels=False)

A directed graph with labelled nodes and arcs.

Parameters
  • transitions – an iterable of triples \((f, l, t)\) of three strings representing the \((f, t)\) edge of the graph with label \(l\).

  • S (str) – the starting node.

  • F (set) – the set of final nodes.

  • large_labels (bool) – whether the string labelling nodes are long and require pre-formatting before being drawn or not.

Examples

The following code

StateTransitionGraph([
      ('node a','arc (a,b)','node b'),
      ('node a', 'arc (a,c)', 'node c'),
      ('node b', 'arc (b, c)', 'node c')],
  'node a', {'node b'})

gives the following graph

_images/stg.svg

where nodes and edges are labeled as implied by the first parameter, the staring node has an (unlabelled) edge entering into it, whereas the final nodes are doubly circled.

classmethod from_automaton(A, coalesce_sets=True, large_labels=False)

A factory method to build a StateTransitionGraph starting from an Automaton.

Parameters
  • A (Automaton) – the automaton.

  • coalesce_sets (bool) – whether the automata states are sets and the corresponding labels must be obtained joining the strings in the sets.

Utilities and decorators

class Queue(iterable=None, maxlen=None)

A convenience implementation of a queue providing the usual enqueue, dequeue, and copy methods.

class Stack(iterable=None, maxlen=None)

A convenience implementation of a stack providing the usual push, pop, peek, and copy methods.

warn(msg)

Emits a string on the standard error.

peek(s)

Returns the fist element of the given set.

union_of(s)

Return the set union of its arguments.

closure(f)

Wraps a function in a closure computation.

This decorator takes a function f(S, O) and returns a function F(S, O) that repeatedly calls f on its own returned value and O until it doesn’t change.

Parameters

f – the function to wrap in the closure computation.

Example

Consider the following example where the function f given a set S of integers and an integer m as input, returns the set obtained adding to S all the values s - 1 for all its elements greater than some lower limit m.

>>> @closure
... def reduce_up_to(S, m):
...     return S | {s - 1 for s in S if s > m}
...
>>> reduce_up_to({7,5}, 3)
{3, 4, 5, 6, 7}

It is evident that its closure will return the set of values from m to the largest element in S.

Notes

More formally, consider a function \(f : \mathfrak{D}\times \mathfrak{X} \to \mathfrak{D}\) where \(\mathfrak{D}\) is a domain of interest and \(\mathfrak{X}\) is the (possibly empty) domain of extra parameters. Define as usual the iterated application of \(f\) as \(f^{(n + 1)}(D, X) = f(f^{(n)}(D, X), X)\), where \(f^{(0)}(D, X) = f(D, X)\). Then \(F = \textit{closure}(f)\) is the function \(F: \mathfrak{D}\times \mathfrak{X} \to \mathfrak{D}\) defined as \(F(D, X) = f^{(\hat{n})}(D, X)\) where \(\hat{n}\) is the minimum \(n\) (depending on \(D\) and \(X\)) such that \(f^{(\hat{n} + 1)}(D, X) = f^{(\hat{n})}(D, X)\).

show_calls(show_retval=False)

Wraps a function so that it calls (and return values) are printed when invoked.

This decorator takes a function a decorated function that prints every invocation (with the value of the parameters and optionally the returned value).

Parameters
  • f – the function to wrap in the closure computation.

  • show_retval – wether to show also the return values.

Example

This is how to use the decorator with the classic recursive implementation of the Fibonacci numbers computation.

>>> @show_calls(True)
... def fib(n):
...     if n == 0 or n == 1: return 1
...     return fib(n - 1) + fib(n - 2)
...
>>> _ = fib(3)
┌fib(3)
│┌fib(2)
││┌fib(1)
││└─ 1
││┌fib(0)
││└─ 1
│└─ 2
│┌fib(1)
│└─ 1
└─ 3