Newsletter:

(Tutorial) JJTree Tutorial for Advanced Java Parsing

Tutorial: JJTree Tutorial for Advanced Java Parsing

The Problem
JJTree is a part of JavaCC is a parser/scanner generator for Java. JJTree is a preprocessor for JavaCC that inserts parse tree building actions at various places in the JavaCC source. To follow along you need to understand the core concepts of parsing. Also review basic JJTree documentation and samples provided in JavaCC distribution (version 4.0).

JJTree is magically powerful, but it is as complex. We used it quite successfully at my startup www.moola.com. After some the basic research into the grammar rules, lookaheads, node annotations and prototyping I felt quite comfortable with the tool. However, just recently when I had to use JJTree again I hit the same steep learning curve as if I have never seen JJTree before.

How to write a tutorial that gets you back in shape quickly without forcing the full relearning?

The Solution
Here I capture my notes in a specific form that I do not have to face that same learning curve again in the future. You can think my approach as layered improvement to a grammar that follows these steps:

  • get lexer

  • complete grammar

  • optimize produced AST

  • define custom node

  • define actions

  • write evaluator

I always start simple and need to go more complex - this is exactly how I will document it. In each example I start with a trivial portion of grammar and then add some more to it to force specific behavior. New code is always in green. Let's hope this save all of us the relearning.

Reorder tokens from more specific to less specific
The token in TOKEN section can be declared in any order. But you have to pay very close attention to the order because the matching of tokens starts from the top and down the list until first matching token is found. For example notice how "interface" or "exception" are defined before STRING_LITERAL. If we had defined "interface" after STRING_LITERAL "interface" would never get matched, STRING_LITERAL would. 

TOKEN : {
<INTERFACE: "interface" >
| < EXCEPTION: "exception" >
| < ENUM: "enum" >
| < STRUCT: "struct" >

| < STRING_LITERAL: "'" (~["'","\n","\r"])"'" >
| < TERM: <LETTER> (<LETTER>|<DIGIT>)>

| < NUMBER: <INTEGER> | <FLOAT> > 
| < INTEGER: ["0"-"9"] (["0"-"9"])>
| < FLOAT: (["0"-"9"])+ "." (["0"-"9"])>
| < DIGIT: ["0"-"9"] >
| < LETTER: ["_","a"-"z","A"-"Z"] >

The ordering is the same reason why we can't just use "interface" inline in the definition of productions. The STRING_LITERAL will always match first....

[Read More...]

Courtesy:- http://www.softwaresecretweapons.com/