Grit Grammar Parser Reference

This reference guide is for the the NPM grit-parser package.

Project repository: https://github.com/pcanz/grammar-parser

Usage

Node.js

    npm install grit-parser
    const grit = require("grit-parser");

Or use a copy of the: grit-parser.js file from: https://github.com/pcanz/grammar-parser

    const grit = require("./grit-parser.js");

Browser:

Load a copy of the: grit-parser.js file from: https://github.com/pcanz/grammar-parser

After it is loaded the bowser window global: grit is assigned the constructor function.

Using grit:

The grit constructor function is best used as a JavaScript template literal tag:

    const my_grammar = grit` .... grammar rules .... `;

But it can also be used as a regular function:

    const my_grammar = grit(" ... grammar rules ... ");

Or better (to avoid the need to escape back-slash characters):

    const my_grammar = grit( String.raw` ... grammar rules ... ` );

Action functions

Can be defined as:

    my_grammar.actions = {
        rule_action: ... JS function ... ,
        another: ...
    }

Or as a second argument to the grit function:

    const my_grammar = grit(" ... grammar rules ... ", { ... action functions ... });

Parse Functions:

    const my_grammar = grit` .... grammar rules .... `;
    var parse_tree = my_grammar.parse(" ... input ...");

The parse method will throw an exception if the parse fails for any reason.

The match method is the same as the parse method, but it will return a null result if the grammar does not match the input.

    var parse_tree = my_grammar.match(" ... input ...");

PEG Rules

A grammar rule names a PEG expression that contains string pattern matching components.

A PEG expression e may be either:

If e1 and e2 are PEG expressions, then so too is:

The choice e1 / e2 will only try e2 if e1 fails (first match wins), and there is no backtracking.

An empty match is not a failure, so: e1* / e2 should never try to match e2. But in practice it is very easy to make this kind of mistake, and there seems to be no practical reason not to treat a match of nothing the same as a failure. So this is treated as a special case, and the parser will try to match e2 if e1 has not matched any input.

Any PEG expression e can be given a predicate prefix:

Any PEG expression e can be given a repeat suffix:

The result of a repeat is always the longest match and no other.

The binding strength or precedence is (low to high):

So, for example:

String Matching Components

A string matching component in a grammar rule can be either:

Note that quoted literal "x" and the regex [x] will both match a literal character x, but "x" will skip any surrounding white-space, so "x" is equivalent to the regex: \s*(x)\s*

'a+b+c' => regex: [a]\+b\+c
"a+b+c" => regex: \s*(a\+b\+c)\s*

A ^ in a regex will only match at the beginning, but since each regex component is treated as a separate match it is not necessary to use a ^ at the start of a regex component (but it does no harm). Since any regex may begin with a ^ is can be used to introduce any regex component, for example, ^(x)+ is a regex to match one or more x characters, but (x)+ is not a regex, it will match a list of one or more x rules.

Ideally the regex components should be kept simple, such as a char-set eg: [abc], or a repeated char-class eg: \d+. There is no loss of expressive power in keeping the regex matches simple since they are components in a larger PEG grammar. The PEG logic can include lookahead tests eg: &x for positive lookahead tests, or !x for negative lookahead tests, or a semantic action can be used.

Simple regex components also eliminate troublesome issues with some regular expression implementations, see Russ Cox for details.

Semantic Actions

Semantic actions are functions that are applied to the result of a rule match. The actions are defined as properties of an object, which can be assigned to the actions property of a parser:


The function name corresponding to the rule name will be called with the result matched by that rule.

The grit grammar parser constructor function may also be called with two arguments: the grammar rules, and the action functions, as in example 1.2.

The name of a semantic action function may be appended to a rule with a : separator:


If the semantic action returns null then the rule will fail, otherwise the action may return any result as the rule result.

If the rule has a semantic action function name that does not match a custom function then it may match a standard built-in function name.

An application program can always process the default parse tree without using any semantic actions(in that case the parse tree result will be an array of arrays of string values). But it is often more convenient to break the parse tree processing into smaller functions applied to the result of an individual rule.

For example, a calculator application could process the parse tree for arithmetic expressions, but it is simpler to break the calculator application into semantic action functions that incrementally process the results of individual grammar rules:


The action functions are called with two arguments:

The second argument parse object is not often needed, but it gives the action function access to information that can be useful for debugging, or for experimental actions (extending the built-in functions), or even to take over for special case parsing if necessary.

It is good practice to first develop a grammar parser without using any semantic action functions. This first priority is to ensure that the grammar rules recognize input strings correctly, regardless of the structure of the parse tree. After that the rules may be reorganized to simplify the parse tree, and finally semantic actions may be employed to process the parse tree as needed by the application.

It is important to remember that a semantic action may be called before the rule later fails as a component in another rule. This is too late for any side effects the semantic action may have generated.

Built-in Functions

The built-in functions:

Rules of this form: x (op x)* are an idiomatic way to match lists with separators, or arithmetic expressions, and many other formats. These rules generate an awkward parse tree structure which can be simplified by these standard functions:

The yfx and xfy actions generate parse tree nodes in this form:

    [op, left_tree, right_tree]

Here they are used to generate the desired parse tree for our arithmetic expression grammar:


The tree structures:

              +                 ^          
             / \               / \         
  yfx:      +   4      xfy:   1   ^        yfy:  1 + 2 + 3 + 4    xfx:  1 2 3 4 
           / \                   / \
          +   3                 2   ^    
         / \                       / \
        1   2                     3   4     

The semantic actions can be thought of as a sort of type specification for the rule result, which can be appended to the end of the rule.

Debug Trace

The parse function has a second argument for options, which can generate a trace log to help debug a grammar.

Here is an example of using a trace:

    const arith = grit`
        expr   = factor ([+-] factor)*
        factor = term ([*/] term)*
        term   = \d+ / "(" expr ")"
    `;

    var e = expr.parse("1+2*(3+4)-5", {trace: true} );

    0..1 term /\d+/y =>1
    1 factor /[*\/]/y !
    1..2 expr /[+-]/y =>+
    2..3 term /\d+/y =>2
    3..4 factor /[*\/]/y =>*
    4 term /\d+/y !
    4..5 term /\s*(\()/y =>(
    5..6 term /\d+/y =>3
    6 factor /[*\/]/y !
    6..7 expr /[+-]/y =>+
    7..8 term /\d+/y =>4
    8 factor /[*\/]/y !
    8 expr /[+-]/y !
    8..9 term /\s*(\))/y =>)
    9 factor /[*\/]/y !
    9..10 expr /[+-]/y =>-
    10..11 term /\d+/y =>5
    11 factor /[*\/]/y !
    11 expr /[+-]/y !

The numbers on the left are the input position of the match components with the name of the rule the component is in. Only the input match operations are logged in the trace, they usually give the most useful information, and a full trace of all rules can be very verbose.

Grammar Grammar

The grammar rules can define themselves. Here is a minimal grammar that is sufficient to define itself and parse itself:


This grammar only accepts simple regex char-set string match components.

Here is the full grit grammar grammar:

    grammar = rule+
    rule    = name "=" expr ws act?

    expr    = seq ("/" seq)*
    seq     = (ws [&!]? term [*+?]?)*
    term    = ref / quote / regex / group

    name    = ws \w+
    ref     = name !\s*=
    group   = "(" expr ")"

    quote   = '"' [^"]* '"' / "'" [^']* "'"
    regex   = &[[\\^] (chs / par / misc)+
    chs     = [\[] ([^\]\\]* ([\\][^])?)+ [\]]
    par     = [(] ([^()]* par?)* [)]
    misc    = [^[()\s]+

    act     = ":" lines
    lines   = line (\s* !\S+\s*= line)*
    line    = [^\n\r]* 
    ws      = \s*