Monday, November 11, 2019

Antlr 4 grammar with spaces

Intro

    This post discusses the difficulties and solutions of creating an Antlr grammar without removing spaces from the text being parsed. But first, a brief explanation of what Antlr really is.
    Here is the link https://github.com/antlr/antlr4. Antlr can take a text with a well-defined rules and create a syntax tree out of it. It is basically what it is so good for. The user must create the grammar of the DSL (domain-specific language) and feed it to the tool. The tool generates the parser code. It can be a Java code or Javascript code. Antlr has no less than 5 targets (supported languages). This generated parser code can parse the text and create an abstract syntax tree. Then the user can use the generated visitor (it is the well-known visitor pattern) to walk the AST to get something out of it. 

Problem

    Most Antlr grammars remove the whitespace in the lexer like this:
WS
: [ \r\n\t] + -> skip
;
This definitely makes parsing somewhat easier as it transforms this expression:
    q1 + q2 / f2 - ( 4 + 5 )
into a much simpler form:
    q1+q2/f2-(4+5)
Without these spaces it is much easier to write the rules for the Antlr grammar.
   
    But in some cases, it is not possible to remove all spaces because they may be part of a variable name. In the case, I was working on variables were with single quotes like this: 'Variable 1'. Spaces were part of the variable name. The grammar had to take into account that a space could be part of the variable name but also could be a "useless" character that the user put in for readability like a space after a comma. Special handling of spaces is required to parse this expression:

'Variable ABC' * 5 + 'Variable Q'*( 1 + 'Variable W')

Solution

The idea is as follows: if we cannot remove spaces from the text we have to "pull" spaces into parenthesis, operation signs, etc. 
The rules for the operators and parenthesis must change to pull in spaces. Firstly, we need to declare the tokens for spaces:
ANY_SPACE:                    SINGLE_SPACE+;
SINGLE_SPACE:                 ' ';

We cannot use SINGLE_SPACE* because the Antlr tool correctly shows that SINGLE_SPACE* can match an empty string. 
Then incorporate these tokens into rules for operators (MUL is '*' and Div is '/', rParameterSeparator is for the comma that separates function actual parameters):

rLeftVarParenthesis:          L_VAR_PAREN ANY_SPACE | L_VAR_PAREN;

rArithmeticOperatorMulDiv:    ANY_SPACE (MUL | DIV) ANY_SPACE | ANY_SPACE (MUL | DIV) | (MUL | DIV) ANY_SPACE | (MUL | DIV);

rParameterSeparator:          ANY_SPACE ',' ANY_SPACE | ANY_SPACE ',' | ',' ANY_SPACE | ',';

Now, this grammar can handle variables with spaces in their names. The visitor when visiting the operator rule can join the text from the children and use trim to remove all unnecessary space. In Javascript it may look like this:
return ctx.children.join('').trim();

This simple and elegant solution allows for parsing variable names with spaces.