Jack Cough on Software: Refactoring in Scala

Another really long post, but why not!

My Lexer is now approaching 500 lines with roughly 350 lines of test code. Maybe that's still trivial, but it doesn't feel trivial. Maybe it doesn't feel trivial because it does a lot. Those 500 lines of code cover so many features. Yesterday night I started to add a bunch more features, and realized a few spots needed some heavy refactoring to achieve what I want. Here's what I wanted to add:

New line support. Yes, my previous lexer could only lex one line at a time. Syntax errors indicated the offset, but not the line. A parser should know what line and file the token occurred in.

Better handling and testing of syntax errors.

Separate Scala lexing from Lexer core, providing a reusable Lexer framework for Lexing any language.

I had a few other goals as well. I wanted to have the tests be much more organized, mostly arranging for testing individual Finders. So far, everything was lumped together in one giant test. I wanted to do a general cleanup of my Lexer trait, and extract the mutability from it. And as usual, I wanted to make sure things were nice and clean and short and readable. I have a few goals on my plate that I didn't get to as well. Comments, Strings, Chars, peeking at the next token without mutating. Some of that stuff might require the Finders to be away of a Context of some sort. I hope not. Also, maybe I could write a preprocessor to replace any comments with white space.

Multi Line Support

Anyway, lets get to the new line support. Like I said, the main goal was to have better syntax error support, knowing the line and the offset, not just the offset. But the parser needs to more information that just the offset, as well. In order to do this, I decided to introduce the idea of a CompilationUnit, which holds all the code. Previously, the Lexer just worked with a single array.

Old code:


trait Lexer extends SyntaxErrorHander {
  var chars: FunctionalCharArray
  var currentIndex = 0

New code:


trait Lexer extends CompilationUnitWalker with SyntaxErrorHander {
  var unit: CompilationUnit

Notice something else important here. The old code held its chars and was responsible for handing its own mutation, and the mutation was a bit cluttered in with the lexing logic. It wasn't horrible, but not great. Now the mutation is all self contained in CompilationUnitWalker, who manages the current line and current character pointers. I'll save the listing for that trait until the end, as its where all the ugliness lives. But that's the good news, its no longer in Lexer. For now though, it helps to see the interface for that trait:


trait CompilationUnitWalkerInterface {
  var unit: CompilationUnit
  def skipChar: Unit
  def skipChars(i: Int): Unit
  def skipLine: Unit
  def currentLine: FunctionalCharArray
  def currentLineNumber: Int
  def currentOffset: Int
  def eofReached_? : Boolean
}

The finders need the current line, Tokens and errors need currentLineNumber and currentOffset, and the Lexer itself needs to know if it processed the whole file (eofReached_?), and needs commands to move in the array (the skip methods). Since some of this stuff wasn't in the original code, this was a refactoring and a feature add at the same time. No big deal though, just pointing it out.

I'll show the remainder of the Lexer code in the next section.

Syntax Error Handling

We're a little bit closer to getting syntax errors with line numbers in them. I didn't show the SyntaxErrorHandler trait in the last post, but it was trivial. It just took a SyntaxError Token (which I've decided was stupid because SyntaxErrors are not tokens), and it just printed them out. Here's both now:

Old code:


trait SyntaxErrorHander{
  def syntaxError(error:SyntaxError):Unit = {
    println(error)
  }
}

New code:


trait SyntaxErrorHander{
  def syntaxError(unit: CompilationUnit, 
                  lineNumber: Int, offset: Int):Unit = {
    println("Syntax Error(" + unit + ", line number:" + 
               lineNumber + ", offset:" + offset + ")")
  }
}

Not much different, but now the handler takes the CompilationUnit, and the line number and offset. Those are all we really need to handle the requirement. And I do it like this:


trait AddToCompilationUnitSyntaxErrorHander extends SyntaxErrorHander{
  override def syntaxError(unit: CompilationUnit, 
                           lineNumber: Int, offset: Int):Unit = {
    super.syntaxError(unit, lineNumber, offset)
    unit.syntaxError(lineNumber, offset)
  }
}

The code for calling into the syntax error handler in the Lexer actually shaped up nicely as well:


trait Lexer extends CompilationUnitWalker with SyntaxErrorHander {
  var unit: CompilationUnit

  def finders: List[FunctionalCharArray => Option[Lexeme]]

  def nextToken: Token = {

    // if we've already lexed the whole file, get the f out.
    if (eofReached_?) return EOF(currentLineNumber)

    // find all possible Lexemes
    val matches = 
      finders.map(f => f(currentLine)).filter(_.isDefined).map(_.get)

    // if we've found no lexemes, syntax error! adjust and try again.
    if (matches.isEmpty) {
      syntaxError(unit, currentLineNumber, currentOffset)
      skipChar
      return nextToken
    }

    // the longest match found should be on top...i think...
    val longestMatch = matches.sort(_.size >= _.size).head

    // deal with the best lexeme
    handleLexeme(longestMatch)
  }

  def handleLexeme(lex: Lexeme) = lex match {
    case NewLine => {
      skipLine
      nextToken
    }
    case WhiteSpace(_) => {
      skipChar
      nextToken
    }
    case lex => {
      val indexOfLexeme = currentOffset
      skipChars(lex.data.length)
      Token(lex, currentLineNumber, indexOfLexeme)
    }
  }
}

In the middle of the nextToken we check to see if matches is empty. If it is, we haven't found any Lexeme's and we call the syntaxError method with the current CompilationUnit (which we hold), and the currentLineNumber, currentOffset which we get from the walker. Done!

Also notice the handleLexeme method at the end, because this completes the first requirement. When it creates a new Token, it passes the currentLineNumber as well. I'm still debating putting the CompilationUnit in the Token as well. Certainly a parser will know which unit its working on, but it might be helpful. I guess I'll wait til I start writing my parser to find out.

Refactoring Towards Reuse

Refactoring Towards Reuse was by far the most broad and complicated requirement/idea in the set - I thought. But as it turns out, it only took a few minutes total. All I had to do was move a few files, change a few names, honestly almost nothing.

First, I created a package called scala under lex, created a Scala file called ScalaLexer.scala, and started pulling in anything that looked Scala specific. I ended up with this:


trait ScalaLexer extends Lexer with ScalaFinders with 
  AddToCompilationUnitSyntaxErrorHander

trait ScalaFinders extends
    ScalaCharFinders with NumberFinder with ScalaIdentifinder with
    WhiteSpaceFinder with CaseClassFinder with CaseObjectFinder

case object CaseClass  extends SimpleLexeme("case class")
case object CaseObject extends SimpleLexeme("case object")

object ScalaIOLBuiler extends IdentifierOnlyLexerBuilder{
  def apply(cu: CompilationUnit) = {
    new ScalaIOL{ var unit: CompilationUnit = cu }
  }
}
trait ScalaIOL extends IdentifierOnlyLexer with ScalaIdentifinder

case class ScalaTwoPartIdentifinder(
  w1: String, w2: String, l: Lexeme) extends
  TwoPartIdentifinder(ScalaIOLBuiler, w1,w2,l)

trait CaseClassFinder extends LexemeFinder {
  override def finders =
    super.finders ::: 
    List(ScalaTwoPartIdentifinder("case", "class", CaseClass).find _)
}

trait CaseObjectFinder extends LexemeFinder {
  override def finders =
    super.finders ::: 
    List(ScalaTwoPartIdentifinder("case", "object", CaseObject).find _)
}

trait ScalaCharFinders extends LexemeFinder {
  override def finders = super.finders ::: CharFinders(
      Underscore, Comma, Dot, Eq, Semi, Colon, LeftParen,
      RightParen, LeftCurly, RightCurly, LeftBrace, RightBrace
  )
}

I also moved Identifinder into the Scala package as well, since it's Scala specific. All in all, I didn't have to do much. It's only about 30 lines of code and a lot of it can probably be cleaned up and reduced further.

Now, if I wanted to lex Java I'd probably have to write a similar 30 lines of code, plus a JavaIdentifinder. I should do it, because then I'd find further areas for factoring out common code.

So I guess in my next installment I'll do just that, and, I'll show the testing at that point too. The testing is coming along really nicely. At some point I plan to contrast this code with the real Scala lexer code, and the tests for it (I still have to find those).

For now I'll give a quick snippit of the testing I did for handling multiple lines, but I'll wait to explain it:


trait NewLineTests extends LexerTest{

  test("new lines") {
    checkCode("x,\ny",
      (0,0) -> Identifier("x"),
      (0,1) -> Comma,
      (1,0) -> Identifier("y"))
  }

  test("more new lines"){
    checkCode("""x,
    y,
      z""",
      (0,0) -> Identifier("x"),
      (0,1) -> Comma,
      (1,4) -> Identifier("y"),
      (1,5) -> Comma,
      (2,6) -> Identifier("z"))
  } 
}

That's all for now. It's late.

Jack Cough on Software

Thursday, May 14, 2009

Refactoring in Scala

Multi Line Support

Syntax Error Handling

Refactoring Towards Reuse

No comments:

Post a Comment