NOTE: I tried to write this post before but it came out horrible. I'm trimming it down. The original post is in the comments. What remains is mostly notes and tips on how to do development using JavaCC. Not a tutorial, just notes.
Here are the steps we took on the syntactical analysis side:
- We took some examples of the input language and started writing the grammar for it. This took a while because I really had to get myself familiar with JavaCC all over again.
- We tried to generate the parser by running Java CC and failed with a Left Recursion error.
- We looked up how to solve this and after a while figured out that we were simply missing a set of parenthesis.
- We generated the parser from the grammar and ran the parser against some complex input. It failed.
- We repeated this process for a while until we figured out that we need to make our grammar AND our input simpler to start, and work up. This was a good lesson. Don't try to get Everything right in one pass, the errors will be overwhelming. Start small, work up.
- We finally worked up slowly to the point where we could parse our original complex input.
After we got it parsing, we started adding our Objects into the grammar file so that they can be produced at parse time. The same rules applies here. Start small - at the very bottom of your parse tree, the leaves, and have those return small objects that can be passed up the tree one level at a time. Slowly, you'll be able to build nice full objects. Its probably possible to start at the top as well, but I think its more complicated. Depending on if objects have to be constructed complete, it might even be impossible.
In a dream come true, I got to write a bunch of stuff using JavaCC at work today. It was fantastic. I was able to identify the fact that we needed JavaCC for a particular problem, and sell people on it as well - also very exciting points.
ReplyDeleteWe have a user building up a complex (not too complex) query on the web browser. The data structure for the query is held in JavaScript. On submit, the browser posts the data structure to the server as a JSON object. This particular object is annoyingly complicated to parse properly, but it turns out that its easy to produce a fully parenthesized string from. Since we can easily create this String from it, we could probably easily create SQL from it. But, we already have a fully tested Object Model for building SQL, so we decided that its best to go from the JSON object, to a String, and parse the String to produce the Query objects. This might sound overly complicated, and maybe it is. I'm not sure how much logic is in the Query objects. No one ever really mentioned going directly to SQL, so this is the way were doing it.
I mentioned that we should use JavaCC to parse that String and build our Object tree, instead of writing a parser by hand. I was initially overruled, but after a few hours of going disastrously nowhere, minds changed. I suppose this is a good lesson. Don't write parsers by hand. There's just not a good reason to when you have something that can generate a robust parser simply. So we set out to use JavaCC, Leonid and I. Leonid is a Giant Russian. He also has a giant brain, and a giant heart.
I had been re-reading JavaCC code as recently as last night, so this is great. I'm going to post the jj file that we created here so that it can be used for reference throughout the rest of the post.
PARSER_BEGIN(QueryParser)
package net.transacttools.query.query.parser;
import net.transacttools.query.query.*;
public class QueryParser{
public static void main(String args[])throws ParseException{
QueryParser parser = new QueryParser(System.in);
parser.Input();
}
}
PARSER_END(QueryParser)
SKIP:{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN:{
<LPAR:"(">
| <RPAR:")">
| <TAG:"TAG">
| <PROPERTY:"PROPERTY">
}
TOKEN:/* OPERATORS */
{
<ASSIGN:"=">
| <GT:">">
| <LT:"<">
| <EQ:"==">
| <NE:"<>">
| <LIKE:"%%">
| <IN:">>">
| <OR:"||">
| <AND:"&&">
}
TOKEN:/* LITERALS */
{
<STRING_LITERAL:"\""((~["\"", "\\", "\n", "\r"])
| ("\\"(["n", "t", "b", "r", "f", "\\", "'", "\""]
| ["0"-"7"](["0"-"7"])?
| ["0"-"3"]["0"-"7"]["0"-"7"])))*"\"">
}
QueryClause Input():{QueryClause qc;}{
qc=Expression()<EOF>
{
//System.out.println("Top Level QueryClause: " + qc);
}
{ return qc; }
}
QueryClause Expression():{
ConditionBase c;
QueryClause.ClauseType op = null;
QueryClause qc;
QueryClause qc2 = null;
}
{
("(" qc=Expression()
(op=Operator() qc2 = Expression())*
")"){
if( op == null ){
return qc;
}
else{
QueryClause f = new QueryClause(op);
f.put( qc );
f.put( qc2 );
return f;
}
}
| c = Condition(){
qc = QueryClause.createAndClause();
qc.put(c);
return qc;
}
{
//System.out.println("condition="+c);
//System.out.println("op = "+op);
}
}
QueryClause.ClauseType Operator():{
}
{
<AND>{ return QueryClause.ClauseType.AND; }
|
<OR>{ return QueryClause.ClauseType.OR; }
}
ConditionBase Condition():{
Operator o;
String name;
String val;
}
{
(<TAG>"["name = Anything()"]"o = AssignmentOperator()val = Anything()){
//System.out.print(o);
//System.out.print(name);
//System.out.println(val);
}
{
return new FieldCondition(name, o, val);
}
| (<PROPERTY>"["name = Anything()"]"o = AssignmentOperator()val = Anything()){
//System.out.print(o);
//System.out.print(name);
//System.out.println(val);
}
{
return new PropertyCondition(name, o, val);
}
}
String Anything():{
Token val;
}
{
val = <STRING_LITERAL>{
return val.toString();
}
}
Operator AssignmentOperator():{
Token t;
}
{
t = <EQ>{
return Operator.EQUALS;
}
| t = <NE>{
return Operator.NOT_EQUALS;
}
| t = <GT>{
return Operator.GREATER_THAN;
}
| t = <LT>{
return Operator.LESS_THAN;
}
| t = <LIKE>{
return Operator.LIKE;
}
| t = <IN>{
return Operator.IN;
}
}
So anyway, Here are the steps we took on the syntactical analysis side:
* We took some examples of the input language and started writing the grammar for it. This took a while because I really had to get myself familiar with JavaCC all over again.
* We tried to generate the parser by running Java CC and failed with a Left Recursion error.
* We looked up how to solve this and after a while figured out that we were simply missing a set of parenthesis.
* We generated the parser from the grammar and ran the parser against some complex input. It failed.
* We repeated this process for a while until we figured out that we need to make our grammar AND our input simpler to start, and work up. This was a good lesson. Don't try to get Everything right in one pass, the errors will be overwhelming. Start small, work up.
* We finally worked up slowly to the point where we could parse our original complex input.
After we got it parsing, we started adding our Objects into the grammar file so that they can be produced at parse time.
* The same rule applies here. Start small - at the very bottom of your parse tree, the leaves, and have those return small objects that can be passed up the tree one level at a time. Slowly, you'll be able to build nice full objects. Its probably possible to start at the top as well, but I think its more complicated. Depending on if objects have to be constructed complete, it might even be impossible.
Anyway, this post is getting Gigantic and I feel like having a beer. I'm going to give a very brief overview again:
* Start Small, write pieces of your grammar.
o Test
* Build up
o Test
* Slowly integrate your object model after lexical analysis
o Test
* After your object model is integrated ...
o Test
* At this point you can do whatever it is you want with your objects. In many cases people will be converting to another output format. Hopefully your object model is already tested. If not...
o Test!
That might sound like a lot of testing, and well, thats a good thing. You should many, many tests set up with many different input Strings.
Time for beer.