This is a really long post, but IM(not so)HO, its really fun!
Sunday I decided, "I'm going to write my own Lexer for Scala", simply because I felt like it. I'm only about 8 hours in, but I've got a lot of functionality.
And its only about 300 lines of code so far. Now, that 300 lines doesn't cover nearly all of Scala. It's lacking XML handling, and comments, has limited support for numbers, and String literals, Unicode characters and error handling. But, it does have a lot of nice features including the ability to understand most (maybe all) identifiers, and operators.
There are other really nice things about it as well. First, its very functional - it uses HOF's to recognize tokens and it's mostly immutable, with its mutable state isolated to one very small area. Second, the tests for it are simple end elegant. Third, its quite reusable, as I've managed to leverage several components from within itself.
Before I get to the implementation, let me show how to use it. It can be done very simply at the command line. Example:
scala> import com.joshcough.compiler.lex._
import com.joshcough.compiler.lex._
scala> val code = "x"
code: java.lang.String = x
scala> val lexer = new SimpleLexer(code) with Identifinder
lexer: com.joshcough.compiler.lex.SimpleLexer with com.joshcough.compiler.lex.Identifinder = SimpleLexer(FunctionalCharArray([C@b6aea4))
scala> lexer.nextToken
res0: com.joshcough.compiler.lex.Token = Token(Identifier(x),0)
scala> lexer.nextToken
res1: com.joshcough.compiler.lex.Token = EOF(1)
In this code I create a Lexer only capable of recognizing identifiers by mixing in the trait Identifinder (and yes, that trait name is awesome). The "code" that I use here is simply the String x. When I ask the lexer for its tokens, it gives me back exactly what I'd expect, Token(Identifier(x),0), and EOF(1). This means it found the identifier x at offset 0, and found EOF at index 1. Simple, but obviously not yet very useful. Luckily, the identifier recognition is much more powerful. Let's see some more examples (From now on, where appropriate, I'll remove redundant type info from the output or replace it with "...", and simply indent the output):
scala> lexer lexing "-->"
scala> lexer.nextToken
Token(Identifier(-->),0)
scala> lexer.nextToken
EOF(3)
scala> lexer lexing "_ewrk_1212_adf435445_^^^"
scala> lexer.nextToken
Token(Identifier(_ewrk_1212_adf435445_^^^),0)
Much better. But what happens if I pass in something that the lexer shouldn't understand?
scala> lexer lexing "1"
scala> lexer.nextToken
SyntaxError(0)
res4: com.joshcough.compiler.lex.Token = EOF(1)
Not too terrible. "1" isn't a valid identifier and since this lexer only recognizes identifiers, it spits out "SyntaxError(0)", because it did not recognize the input starting at position 0. Now, this is actual System.out output. That's all you get by default. The actual Scala compiler simply adds an error to the Compilation unit with the offset. There really isn't much a lexer can do with errors. Later, I'll show how we can cache them off as well.
The last lexer failed to recognize numbers. That's easy enough to fix. Simply mix in NumberFinder:
scala> val lexer = new SimpleLexer(code) with
Identifinder with NumberFinder
scala> lexer lexing "123"
scala> lexer.nextToken
Token(Number(123),0)
scala> lexer lexing "-->"
scala> lexer.nextToken
Token(Identifier(-->),0)
Now we can recognize identifiers and numbers. Next up we'll start recognizing all of these single characters: [ ] { } ( ) , . : _ = ; This as just as easy as the others - just mix in CharFinders
scala>lexer lexing "(x,y);"
scala> lexer.nextToken
Token(LeftParen,0)
scala> lexer.nextToken
Token(Identifier(x),1)
scala> lexer.nextToken
Token(Comma,2)
scala> lexer.nextToken
Token(Identifier(y),3)
scala> lexer.nextToken
Token(RightParen,4)
scala> lexer.nextToken
Token(Semi,5)
Notice that we haven't seen any white space yet. In fact, the lexer here isn't yet capable of handling white space. To do that - you guessed it - mix in WhiteSpaceFinder.
scala> val lexer = new SimpleLexer(code) with Identifinder with
NumberFinder with CharFinders with WhiteSpaceFinder
lexer lexing "( x , y );"
scala> lexer.nextToken
Token(LeftParen,0)
scala> lexer.nextToken
Token(Identifier(x),2)
scala> lexer.nextToken
Token(Comma,4)
scala> lexer.nextToken
Token(Identifier(y),6)
scala> lexer.nextToken
Token(RightParen,8)
scala> lexer.nextToken
Token(Semi,9)
Notice that like any good lexer, this lexer simply skips over the white space, handing you the next actual token at its right location.
There are a few more, such as RocketFinder (finds =>), CaseClassFinder (finds "case class" and returns it as one Lexeme instead of two seperate identifiers), and CaseObjectFinder (which does exactly the same thing). But, I'm not going to show examples of their usage. I want to get to the implementation.
Lexer Implementation
Earlier I made claim that the Lexer was "mostly immutable". Of course, it has to be somewhat mutable to return something different to the parser on successive calls to nextToken. To do this, the Lexer uses a FunctionCharArray, which is just a very thin wrapper around Array[Char].
case class FunctionalCharArray(chars: Array[Char]) {
def skip(i: Int) = new FunctionalCharArray(chars.drop(i))
def size = chars.size
def nextChar: Option[Char] = get(0)
def get(index: Int): Option[Char] = {
if (chars.size > index) Some(chars(index)) else None
}
}
Taking a brief peek into the Lexer we find that its the FunctionCharArray that is mutated:
trait Lexer extends SyntaxErrorHander {
var chars: FunctionalCharArray
var currentIndex = 0
def finders: List[FunctionalCharArray => Option[Lexeme]]
...
There are a few other interesting bits here. The lexer also mutates its currentIndex. It puts the currentIndex into the Tokens when Lexemes are found. We'll see how that works just a bit later on. Most importantly though, is the fact that almost everything else runs through the Lexer's
finders.
The finders are actually quite simple. Given a FunctionalCharArray, the finders return an Option[Lexeme]. They return Some(Lexeme...) if they found what they expected at the very beginning of the array, and None otherwise. The remainder of the array is simply ignored by the finder. A few simple examples should help.
- The Identifinder returns Some(Identifier(x)) when its given the an array containing 'x' as its only element, and None if the first character in the array is a character that can't legally start an identifier.
- The NumberFinder returns Some(Number(123)) when given this array: ['1', '2', '3', ' ', '+', ' ', '7']
- The CharFinders only ever look at the first element in the array.
To fully demonstrate just how simple they are, we need to see some. Have at you!
trait WhiteSpaceFinder extends LexemeFinder {
override def finders = super.finders ::: List(findWhiteSpace _)
def findWhiteSpace(chars: FunctionalCharArray): Option[Lexeme] = {
chars.nextChar match {
case Some(' ') | Some('\t') | Some('\n') => Some(WhiteSpaceLex)
case _ => None
}
}
}
The first thing to notice about this finder is that it immediately adds a finder method to it's super. This is what enabled us to keep mixing in finder after finder in the beginning of this post.
The finder that it adds is its own findWhiteSpace method. As I explained, it takes a FunctionalCharArray and returns an Option[Lexeme]. In this case, it peeks at the top character in the array, and it that char is a space, tab or new line, it returns a Some with the Lexeme, and if its not, it returns None. Simple.
That one is pretty trivial though, it only needs to look at one character. Let's take a look at one that's more involved. Here is a version of CaseClassFinder. It's not the exact implementation, but I've only changed it slightly to help demonstrate.
class IdentifierOnlyLexer(var chars: FunctionalCharArray) extends
Lexer with WhiteSpaceFinder with Identifinder
trait CaseClassFinder extends LexemeFinder {
override def finders = super.finders ::: List(findCaseClass _)
def findCaseClass(chars: FunctionalCharArray) = {
if (chars.size < 10) None
else {
val lexer = new IdentifierOnlyLexer(chars)
(lexer.nextToken, lexer.nextToken) match {
case (Token(Identifier("case"), _),
Token(Identifier("class"), _)) => {
Some(CaseClass)
}
case _ => None
}
}
}
}
I love this example because it's small, but not trivial, and because it reuses components in the lex package. Before I explain, let's think about what a CaseClassFinder should do. If it see's "case", followed by "class" at the beginning of the array, then it should return the Lexeme CaseClass. Otherwise, it should return None. But, its more important to think about that in a slightly different way:
If a CaseClassFinder finds
the Identifier "case" followed by
the Identifier "class", then it should return CaseClass.
Well, we already know how to find identifiers...it was the first thing I showed in this post! As it turns out, that's exactly how CaseClassFinder does it as well. It creates a new Lexer with its input, an IdentifierOnlyLexer (technically, it could fire up a Lexer with all the finders, but it's overkill). It then asks the Lexer for its first two tokens, and if those Tokens contain the Lexemes Identifier("case"), and Identifier("class") the it knows its found a case class. BAM!
The Indentifinder and NumberFinder traits are too long to show here. I'll post a link to the codebase.
Now, we have a few things to do to wrap up. I still need to show the how the Lexer itself actually uses the finders. And, the astute reader might have already realized that two finders might both return a Lexeme. For example, given the Array [':',':'], the Colon finder would return Colon, and the Identifinder would return Identifier(::). The Lexer handles this case easily. The real value
must be the longer of the two. Simple as that. Now lets take a look at all of Lexer.
Lexer Implementation (for real this time!)
1 trait Lexer extends SyntaxErrorHander {
2
3 var chars: FunctionalCharArray
4 var currentIndex = 0
5
6 def finders: List[FunctionalCharArray => Option[Lexeme]]
7
8 def nextToken: Token = {
9
10 def skip(i: Int) {currentIndex += i; chars = chars.skip(i)}
11
12 val token = nextToken_including_whitespace_and_syntaxerror
13
14 token match {
15 case SyntaxError(i) => {
16 syntaxError(SyntaxError(i))
17 skip(1)
18 nextToken
19 }
20 case EOF(_) => token
21 case Token(WhiteSpaceLex, _) => skip(1); nextToken
22 case Token(lex, _) => skip(lex.data.length); token
23 }
24 }
25
26 private def nextToken_including_whitespace_and_syntaxerror = {
27 if (chars.nextChar == None) EOF(currentIndex)
28 else {
29 val lexemesFound: List[Lexeme] = {
30 finders.map(f => f(chars)).filter(_ match {
31 case Some(t) => true
32 case _ => false
33 }).map(_.get)
34 }
35
36 if (lexemesFound.size == 0) return SyntaxError(currentIndex)
37
38 val lexemesSorted =
39 lexemesFound.sort(_.data.size >= _.data.size)
40
41 Token(lexemesSorted(0), currentIndex)
42 }
43 }
44
45 def lexing(s: String): Lexer = {
46 lexing(new FunctionalCharArray(s.toCharArray))
47 }
48
49 def lexing(cs: FunctionalCharArray): Lexer = {
50 chars = cs; currentIndex = 0; this
51 }
52}
First, look at the nextToken_including_whitespace_and_syntaxerror which starts on line 26. This method returns the best Token possible (like :: vs : ), but more importantly, it always returns a Token. It returns WhiteSpace and SyntaxError tokens as well. The calling method, nextToken, is the guy in charge of filtering those out, resetting, and continuing. But, well get to that in just a second. For now, lets enumerate the nextToken_including_whitespace_and_syntaxerror methods steps.
- The first thing it does is check if there are any characters left. If there are none, it simply returns EOF. Any more calls to nextToken will continue to return EOF until the Lexer gets new data (via the lexing methods).
- It then calls all of its finders with the current array: finders.map(f => f(chars))
- It then immediately filters the results, because its only interested in any finders that actually returned Some(Lexeme...), and not none. Of course.
- At that point (line 36) it checks to see that someone actually found a Lexeme (if (lexemesFound.size == 0) return SyntaxError(currentIndex)). If no finders found anything, the we must have some unrecognized character in the input. Syntax Error!
- On line 38 it sorts the Lexemes by size, aiming to get the largest.
- Finally it returns the largest token (ignoring any other matches).
Now I'll explain the nextToken method, and then wrap up.
The first action nextToken takes is to call the nextToken_including_whitespace_and_syntaxerror on line 12. With that token, it must make a decision.
- If it receives a SyntaxError, then call the notify the syntax error handler of it, and in an attempt to reset/restart - move ahead one space in the input, and recur.
- If it receives an EOF, just hand it over to the Parser, it should know what to do with it.
- If it receives WhiteSpace, then also move ahead one space and recur. There might be a better strategy here (and/or for syntax errors), but this one is simple, and works ok.
- Finally, if it gets a legitimate token, then move the entire array ahead by the length of the Lexeme that was found, and return the token. The next time nextToken is called, it should start at the new location, one character beyond the end of this token.
Wrapping up, the way I've implemented this, I don't think it will be that difficult at all the add in missing features. XML will probably be a bear, just because. I guess I don't have String literal support at all, so I'll add that next, and Char literals too. They should be mostly straightforward.
If you're actually still reading, awesome! Thanks! I hope you've learned something. You can find all the actual source code
here. Not bad for about 8 hours huh?
Oh BTW, my only two references for this were the Dragon Book (2 ed), and the Scala compiler itself.
Bye!