Contextual terminal symbols#

Now that we have attributes, a follow-up token can be recognized by the parser, and yet raise an error during the evaluation of the attribute expression.

I believe we have reached a point where the classic features of parsers (synthesized and inherited attributes) are not enough for guided generation. We need to have not only the list of types of token which can follow in a sentence, but also which values are admissible.

To the best of my knowledge, this is a new kind of feature. Let’s call them contextual terminal symbols: terminal symbol which matching value is depending on the context of execution.

Regex attributes for terminal symbols#

So we introduce (yet) another kind of attributes: python expression attached to terminal symbols.

Those expressions are expected to return a regex, and they are evaluated when a lookahead on that symbol is checked. The value of the next token is then matched against that regex in order to decide if the corresponding transition is admissible or not.

Let’s see on a mock example what it means. We define the grammar for a language of let ... in {...} statements with two rules:

  • variable shadowing in nested declarations is allowed; so let x, y in {let y, z in {...}} is correct

  • the same variable cannot be declared multiple times in the same let statement; so let x, x in {...} is invalid

from lark import Lark

nested_vars_grammar = """
%import common.NUMBER
%import common.CNAME
%import common.WS
%ignore WS

%python_header {{
from collections import Counter

def init_ctx(global_vars): 
    from collections import defaultdict
    global_vars.ctx = defaultdict(lambda: 0)

def add_vars(global_vars, var_list):
    global_vars.ctx.update([(k, global_vars.ctx[k] + 1) for k in var_list])
    
def remove_vars(global_vars, var_list):
    global_vars.ctx.update([(k, global_vars.ctx[k] -1) for k in var_list])
   
def exclude_vars(var_list):
    if len(var_list) > 0:
        return "(?!(" + '|'.join([w + '$' for w in var_list]) + "))"
    else:
        return "(.*?)"
}}

start: expr{{ init_ctx(GLOBAL) }} _expr

expr : "let" declar_var{{ [] }} "in" "{" expr{{ add_vars(GLOBAL, syn[2]) }} "}" _expr{{ remove_vars(GLOBAL, syn[2]) }}
| "some_expr" "(" args ")" _expr

_expr: 
| ";" expr

declar_var: CNAME{{ exclude_vars(inh) }} _declar_var{{ [syn[1], *inh] }}      {{ [syn[1], *syn[2]] }}

_declar_var: "," declar_var{{ inh }}      {{ syn[2] }}
|    {{ inh }}

args: atom _args

_args:
| "," args

atom: NUMBER
| CNAME{{ '|'.join([k + '$' for k,v in GLOBAL.ctx.items() if v > 0]) }}

"""
parser = Lark(nested_vars_grammar, parser="lalr")

The point of interest here are the attributes attached to the CNAME terminal symbol in the rules for atom and declar_var:

  • in atom, we are inside the core an expression, the attribute {{ '|'.join([k + '$' for k,v in GLOBAL.ctx.items() if v > 0]) }} is evaluated to the regex matching all the variable names declared in that context

  • in declar_var, we are inside a let declaration, the attribute {{ exclude_vars(inh) }} is evaluated to a regex matching anything but the variables already declared in the same let

The accepts() method of the interactive parser has been modified to return these regex with the next admissible tokens:

def next_token(expr, parser):
    interactive_parser = parser.parse_interactive(start="start")
    for t in parser.lex(expr):
        interactive_parser.feed_token(t)
    return interactive_parser.accepts()

next_token("""
        let x, y in { some_expr(x,y); let w in { some_expr(
""", parser)
{Token('NUMBER', '(.*?)'), Token('CNAME', 'x$|y$|w$')}
next_token("""
        let x, y in { some_expr(x,y); let w in { some_expr(w) }; let y,z in { some_expr(
""", parser)
{Token('NUMBER', '(.*?)'), Token('CNAME', 'x$|y$|z$')}
next_token("""
        let x, y in { some_expr(x,y); let y,z in { some_expr(13) }; let z, v, w,
""", parser)
{Token('CNAME', '(?!(w$|v$|z$))')}

So the parser not only tells us what are the next types of token accepted to continue the sentence, but also which values are valid.

An important note: these regex don’t affect the lexer; the values of the token are matched against the regex after they have passed the lexer.

Implementation#

Building the parser#

The contextual terminal symbols are treated like regular terminal symbol by the builder.

This could lead to mistakes when the same token type appears as a lookahead in multiple items of a LROItemSet, but with possibly different contextual expression.

Concretely we need to detect grammars like

start: CNAME{{ 'foo' }} "foo"
| CNAME{{ 'bar' }} "bar"

The parser will raise an error (a Shift/Shift conflict of some sort) when during the computation of an LR0ItemSet the same symbol name with an AST expression appears as next symbol in the progression of multiple rules.

Contextual lookahead#

Internally the core of the parser is a dictionary dict[State, dict[Token, Tuple]]:

  • in a given state, it return a dictionary of transitions dict[Token, Tuple]]

  • the transition table matches a lookahead token with the next steps of the parser: a Tuple of a new state, a Reduce or Shift action

This is where the AST expression of the contextual terminal symbol is inserted: it is added to the Tuple in the transitions tables (during the compilation of the parser).

The transitions dictionaries of each state are wrapped into an object which will capture the consulting of the transitions. When it is accessed, it will evaluates the AST expression and return a KeyError if the token don’t match the regex pattern.

The accepts() method is simply returning those (non empty) regex with the token type.