ATS supports user defined symbols with custom fixity and precedence. The patopt compiler will do a dedicated pass to resolve fixity before all other passes.

I tried to build a parser that supports the samething using Antlr. After several tries, I finally did it and I would like to share it here.

Precedence Climbing Parsing

Antlr is using this technology internally [antlrbook]. You can read the details here at the blog, [Norvell]. This guy extends the original algorithm to support postfix and non-assoc operators. He also gives it a name, "precedence climbing". The blog is a must read.

The idea is simple and elegant. Usually, a traditional grammar for expression is to create a new nonterminal for each level of precedence as follows

atom: NUMBER | '(' expr ')' | '-' atom;
mult: atom (('*' | '/') atom)*;
expr: mult (('+' | '-') mult)*;

But this is like a hard coded precedence and fixity. In a precedence climbing way, expr will be parameterized by precedence as expr[p], which contains no operators whose precedence are below p.

Therefore

binaryOp: ...
unaryOp: ...
atom
    : NUMBER 
    | '(' expr[0] ')' 
    | unaryOp expr[q]
    ;
    
expr[int p]
    : atom (binaryOp expr[q])*
    ;

The value q depends on the previous operator's precedence prio(op) and fixity fix(op).

  • if fix(op) = infix left-assoc then q = prio(op)+1
  • if fix(op) = infix right-assoc then q = prio(op)
  • if fix(op) = prefix (assuming unary) then q = prio(op)
  • if fix(op) = postfix (assuming unary) then q = N/A
  • if fix(op) = infix non-assoc then q = prio(op) + 1

(The latter two is not covered in the above example yet.)

In this case, parsing 1+2*3-4 will result in such a trace

expr[0]: 1 + expr[a]
    expr[a]: 2 * expr[b]
        expr[b]: 3, return 3 because '-' is of lower precedence than 'b'
    return (2*3) because '-' is of lower precedence than 'a'
expr[0]: 1 + (2*3) - expr[c]
    expr[c]: 4, return 4 because of EOF token
return 1 + (2*3) - 4

That's saying, a precedence test failure will result in returning the matched part immediatly (like expr[a] and expr[b])and continue to the next iteration of the loop (like expr[0]).

That's it, really an elegant way of parsing expressions.

Semantic Predicates in Antlr

For this part, I recommend this blog, Stuff, and this stackoverflow Q and A. Altought they are about Antlr 3, but the rational behind it still holds.

From my understanding, predicat is a tool for you to stop the current matching attempt. But depending on where you put it, you will have different recovery behaviour. Also, predicate can only reference tokens before it (using labels), or you have to call lookahead methods to get what comes after it.

From my exprience, don't put predicates in the middle of a rule (or please tell me what's the semantic of such predicates, I'm not clear of that). Please either put it as the beginning of an alternative (including embeded sub-alternative, and loop), or as the end of an loop.

  • When you put it at the beginning, the predicats will help parser to choose among alternatives.
  • When put it at the beginning of a loop, it helps decide whether to go into or stop looping.
  • When you put it at the end of the loop, it will decide when to stop the loop, and return whatever it already parsed.

Allowing User-defined Operators

The code speaks itself.

The first predicate

{(fmap.get(_input.LT(1).getText()).contains("infix")) && pmap.get(_input.LT(1).getText()) >= $p}?

ensures that only infix operators enter this rule. Postfix operators will enter the next rule with this predicates

{(fmap.get(_input.LT(1).getText()).equals("postfix")) && pmap.get(_input.LT(1).getText()) >= $p}?

As defined, expr[p] only matches an expression where all operators inside it are of equal or higher precedence. So we have >= $p.

Note that since our predecates are located before the token, we need to lookahead using _input.LT(1).

{!fmap.get($op.text).equals("infix")}? 

This ensures that an non-assoc operator will not chain into something like

a = b = c //sytax error

For the aexpr,

{fmap.get(_input.LT(1).getText()).equals("prefix")}?

ensures that only prefix operators enter this rule.

Finally, expr[nextp($op.text)] will recursively enter expr[] rule with a computed next precedence according to our definitino above.

Result

The parser will generate the following tree for the code below

prefix 6 -;
infixr 6 ^;
infixl 3 + ;
infixl 5 * /;
infix 10 =;
postfix 7 !;
 
1 ! ! + - 2 * 3 ^ 5 ^ 6 + - - 4 ! = 5 = 6;

Please note that I assign precedence 10 to = because I would like to show you it is non-assoc.

Comments are welcomed.