Sunday, May 31, 2009

Good Use Case for Named Arguments

 
I found what I consider a really great use case for using keyword arguments. I'm sure there are tons of them, I just want to list this one.

Here I have some code that I had in CPU Simulator that I was using to test a FlipFlopChain, which is essentially just an N Bit Memory (maybe I'll rename it that).

val data = AllGeneratorNumber(8)
val chain = FlipFlopChain(data, Generator.on)
...

Notice that the second parameter to FlipFlopChain is Generator.on. What is this parameter though? We have to navigate into FlipFlopChain just to find out that it represents the write bit. If we didn't have the source, we might have no idea.

So I changed the code to make it more explicit, by declaring a writeBit val.

val data = AllGeneratorNumber(8)
val writeBit = Generator.on
val chain = FlipFlopChain(data, writeBit)
...

Note that I never use the writeBit again in the test code, its just there to help the reader understand what that second parameter is.

With keyword arguments, there is a much better solution:

val data = AllGeneratorNumber(8)
val chain = FlipFlopChain(data, writeBit=Generator.on)
...

Win!

Saturday, May 30, 2009

Some Simple Scala Refactorings

 
I constantly go back and tinker with my old Scala code, especially in my CPU Simulator project. When I'm in the old code I notice something that I haven't really noticed before - the code is still good. In a previous life when I wrote Java code, I'd go back and look at old code and want to throw up. Scala is so concise that this just doesn't happen. Yes, I do find room for improvement here and there, but overall I'm still really happy with the code (there was an exception, when I didn't know anything about functional programming, and refactored from imperative to functional...but that doesn't count).

Anyway, I have a few refactorings that I wanted to mention.

Refactoring to Case Classes


I refactored all of my LogicGate classes to be case classes instead of regular classes. The reason? The code is just prettier. I'm not a huge fan of the new keyword in general, and I especially don't like it littering up my code.

Old Code:

class XorGate(inputA: PowerSource, inputB: PowerSource)
extends LogicGate {
val output: PowerSource =
new AndGate(
new OrGate(inputA, inputB),
new NandGate(inputA, inputB))
}

New Code:

case class XorGate(inputA: PowerSource, inputB: PowerSource)
extends LogicGate {
val output: PowerSource =
AndGate(OrGate(inputA, inputB), NandGate(inputA, inputB))
}

The difference is small, but I literally had hundreds of new calls littered throughout my code. Now, I could also mention the extra power that case classes give as well - nice toString, pattern matching, hash code and equals... I happened to not really be using most of those things in the CPU simulator, so I don't have an immediate example. Oh well. In my opinion/experience, favor case classes over regular classes.

This does violates encapsulation somewhat though. Previously, inputA and inputB weren't accessible from outside the class. I can get around that easily enough by adding private to my vals:

case class XorGate(private val inputA: PowerSource,
private val inputB: PowerSource)
extends LogicGate {
val output: PowerSource =
AndGate(OrGate(inputA, inputB), NandGate(inputA, inputB))
}

Now I'm not exposing those fields, I still have all the power mentioned above, and I don't have the pesky "new" statements hanging around.

Refactoring to Fewer Files


I noticed something a little unsettling. I had separate files LogicGate.scala, AndGate.scala, OrGate.scala, NandGate.scala NorGate.scala, XorGate.scala. One for each type of gate. Several files, and they were all very very small, less than 10 lines each, all with a common package statement, and similar imports.

So I tried something - putting them all into one file. This is something I normally do by default now. I'm not sure when I started putting lots of classes into one file and not spreading them out...Anyway, in my opinion the result was a lot better. Instead of 6 files roughly 5-10 lines long, I have one file less than 40 lines long. I got rid of the redundant package and import statements. But most importantly, now I can see everything there is to know about my logic gates on the screen at one time.


package com.joshcough.cpu.gates

import electric.{Relay, Inverter, Wire, PowerSource}

object LogicGate{
implicit def logicGateToPowerSource( lg: LogicGate ): PowerSource = lg.output
}

trait LogicGate{
val inputA: PowerSource
val inputB: PowerSource
val output: PowerSource
}

case class AndGate(val inputA: PowerSource,
val inputB: PowerSource) extends LogicGate {
val output: PowerSource = Relay(inputA, Relay(inputB))
}

case class NandGate(val inputA: PowerSource,
val inputB: PowerSource) extends LogicGate{
val output = new Wire
Inverter(inputA)-->output
Inverter(inputB)-->output
}

case class NorGate(val inputA: PowerSource,
val inputB: PowerSource) extends LogicGate{
val output: PowerSource = Inverter(inputB, Inverter(inputA))
}

case class OrGate(val inputA: PowerSource,
val inputB: PowerSource) extends LogicGate{
val output = new Wire
inputA-->output
inputB-->output
}

case class XorGate(val inputA: PowerSource,
val inputB: PowerSource) extends LogicGate {
val output: PowerSource =
AndGate(OrGate(inputA, inputB), NandGate(inputA, inputB))
}


I know some people who have said they will never do this, never having more than one class in a file. I think that's wrong. When you have several small classes, seeing everything at once overrules most arguments.

More


I probably should suck it up and turn my logic gates into actual functions at some point in the near future. That should provide some really interesting material on refactoring. Until then, cya.

Simple Scala Keyword Parameters

Everything here is mostly intuitive, but I'm putting it here for personal reference. Should be helpful to a lot of people though. There's probably a whole ton of interesting cases I'm leaving out, and I'll try to keep this updated when I think of them. For now I'm covering Default Values, No Default Values, and Overloading. (Note, this doesn't come out until 2.8, I'm working off trunk)

Default Values


  • Define a class that takes two keyword args, name and lives, and provide default values.

    scala> case class Cat(name:String="kitty", lives:Int=9)
    defined class Cat

  • Instantiate the cat providing no arguments.

    scala> new Cat
    res1: Cat = Cat(kitty,9)

  • Instantiate a cat providing both params, unnamed.

    scala> Cat("Java", 1)
    res2: Cat = Cat(Java,1)

  • Instantiate a cat providing just the first param, unnamed.

    scala> Cat("Scala")
    res3: Cat = Cat(Scala,9)

  • Instantiate a cat providing both params, named.

    scala> new Cat(name="newspeak", lives=20)
    res4: Cat = Cat(newspeak,20)

  • Instantiate a cat providing both params in reverse order. (Only works if names the argument names are given.)

    scala> new Cat(lives=20, name="newspeak")
    res5: Cat = Cat(newspeak,20)

  • Instantiate a cat providing the first param, named.

    scala> new Cat(name="newspeak")
    res6: Cat = Cat(newspeak,9)

  • Instantiate a cat providing the second param, named.

    scala> Cat(lives=4)
    res7: Cat = Cat(kitty,4)

  • Instantiate a cat providing the first argument unnamed, and the second argument named!

    scala> new Cat("Lua", lives=1)
    res8: Cat = Cat(Lua,1)

  • Attempt to instantiate a cat providing the first argument named, and the second argument unnamed. You can't do it! After you name a parameter, the parameters that follow must be named.

    scala> new Cat(name="Lua", 1)
    :7: error: positional after named argument.
    new Cat(name="Lua", 1)
    ^

No Default Values


  • Redefine class Cat without supplying default values. This means values must be provided when instantiating a cat (more generically, calling a function).

    scala> case class Cat(name:String, lives:Int)
    defined class Cat

  • Instantiate a cat providing both params, unnamed.

    scala> Cat("Java", 1)
    res9: Cat = Cat(Java,1)

  • Instantiate a cat providing both params, named.

    scala> new Cat(name="Douglas", lives=42)
    res10: Cat = Cat(Douglas,42)

  • Attempt to instantiate a cat, not providing values for the arguments.

    :7: error: not enough arguments for constructor Cat:
    (name: String,lives: Int)Cat, unspecified parameters:
    value name, value lives
    new Cat
    ^

Overloading


  • Redefine the class, overloading the constructor.

    scala> case class Cat(name:String="kitty", lives:Int=9){
    | def this(name:String) = this(name, -54)
    | }

  • Instantiate a cat providing just the first argument, unnamed. Since the compiler does find a method with the exact signature (in this case - one String), it calls it.

    scala> new Cat("Martin")
    res11: Cat = Cat(Martin,-54)

  • Here's a couple of other similar, and interesting cases. First, overload the constructor giving the argument a different name than the one defined in the primary constructor.

    scala> case class Cat(name:String="kitty", lives:Int=9){
    | def this(x:Int) = this("hello", x+99)
    | }
    defined class Cat

  • Instantiate a cat, using the keyword 'lives'. Since the overloading method uses the name x, and x!=lives, and the original constructor uses 'lives', that is the method that is invoked.

    scala> new Cat(lives=8)
    res12: Cat = Cat(kitty,8)

  • Instantiate a cat - same as the String overloading case above.

    scala> new Cat(8)
    res13: Cat = Cat(hello,107)

Tuesday, May 19, 2009

Teaching Functional Programming To Kids

I started teaching some of the basic concepts of functional programming to my 8 year old son yesterday, and wanted to write a little about it. The wonderful thing about it is that kids really are ready to learn the concepts at a very young age. I'm not actually teaching him programming, just concepts, but when the time comes, he'll basically already know how to program.

I have what I think is an absolutely perfect example. It's one that all parents and kids can identify with: The Dr. Seuss Star Belly Sneetch machine.




This is a simple machine that takes in a Sneetch without a star on its belly, and spits out a Sneetch with a star on its belly. I'm just going by memory here to say that I think kids can probably understand this concept at the age of 3. In this post I'm going to use this style to represent machines:

-----------
| |
Sneetch ---> | * | ---> Star Bellied Sneetch
| |
-----------

This is the same as the actual picture above, but works for all cases since I don't have pictures for all the concepts I want to represent.

In the Dr. Seuss book there is also the opposite machine that removes the star from a star bellied Sneetch.

-----------
Star Bellied | |
Sneetch -> | -* | ---> Sneetch
| |
-----------

While that seems really simple, it's all we need to start teaching a wide range of concepts to kids. I started with this one, because of its similarity with the machines above (for all the things below, Jacy and I worked them out on a whiteboard. But, paper is just as good):

-----------
| |
10 ---> | +5 | ---> 15
| |
-----------

Here we have a machine that adds five to whatever you put into it. Very simple, very easy for kids to understand. It helps to run a few more inputs through (0, 10, a billion) just to let them know that the box doesn't just take 10 and give 15, it works with all numbers.

After this one I followed up with another very easy one.

-----------
| |
10 ---> | -5 | ---> 5
| |
-----------

At this point he pointed out, "Well it could be a divided by two machine instead." This was unexpected, and impressive, and at some point I'll talk about it further...but not yet. It was great to feel that he was understanding it though.

Now that he was getting it, it was time to change things up just a little bit. I introduced the plus and minus machines, which take two inputs instead of one.

-----------
7 ---> | |
| + | ---> 19
12 ---> | |
-----------

-----------
12 ---> | |
| - | ---> 5
7 ---> | |
-----------

These presented no challenge whatsoever. In fact (I guess rather surprisingly), nothing I taught him presented any sort of challenge. Next I introduced the biggest and smallest machines (which we programmers call max and min).

-----------
7 ---> | |
| Biggest | ---> 12
12 ---> | |
-----------

-----------
7 ---> | |
| Smallest | ---> 7
12 ---> | |
-----------

-----------
10 ---> | |
| Smallest | ---> 10
10 ---> | |
-----------

I guess he was a bit surprised when I showed him the last one. But, it only took showing him the answer once for him to fully understand.

I then added an equals machine that spits out YES! if the two numbers are equal, and NO! if they aren't (true and false, obviously). This is different because now we were no longer working with numbers as the inputs and outputs.

-----------
7 ---> | |
| = | ---> NO!
12 ---> | |
-----------

-----------
7 ---> | |
| = | ---> YES!
7 ---> | |
-----------

Simple, but effective.

Now, Jacy and I have done considerable work with logic gates, and I wanted to show him how logic gates are really just like machines. I also taught him the word Function at this point, but didn't push it. Kids can relate to machines, not functions.

------------
ON ---> | |
| AND GATE | ---> OFF
OFF ---> | |
------------

------------
ON ---> | |
| OR GATE | ---> ON
OFF ---> | |
------------

------------
ON ---> | |
| AND GATE | ---> ON
ON ---> | |
------------

While the logic gate examples seem simple, it tied two worlds that we've been working on together very nicely.

Fun


My son has a really short attention span, and all the while I'm doing this I have to think of different ways to make it fun. If it's not fun, hes just going to go play video games. Rightfully so, video games are fun. There were a few ideas I tinkered with before settling on the Dr. Seuss machine. One was a monster that stuffs stuff into his mouth and then spits out the answer. I thought that one was kind of neat. The point is, if you plan on teaching your child, think of something fun they can relate to.

Combining Machines



I could sense we needed some more fun at this point, and we'd learned enough basic machines that I thought it would be great to start combining them. After trying this, I recommend starting with all alike boxes. We did something a little more complicated and I ended up going too fast, and fell back on this:

-----------
7 --> | | -------
| + | --> 19 --> | |
12 --> | | | |
----------- | |
| + | --> 39
----------- | |
10 --> | | | |
| + | --> 20 --> | |
10 --> | | -------
-----------

After you get this first larger machine done, its pretty easy to add in more complicated machines. However, it might be good to wait until the next day, as Jacy was definitely getting it, but might have been getting a little fried. Here's an example though:

-----------
7 --> | | -------
| + | --> 19 --> | |
12 --> | | | |
----------- | |
| = | --> NO!
----------- | |
10 --> | | | |
| + | --> 20 --> | |
10 --> | | -------
-----------

And obviously, change the 7 to an 8 and get a YES! Different kinds of boxes doing and spitting out different kinds of things. In essence, this is really all we do as programmers.

Types


Being a lover of static type systems, I also talked to him about types, by saying "kind(s) of things". For example, I asked him, "What kind of thing does this machine take in (or what kind of thing do you put into this machine)?"

-----------
| |
10 ---> | +5 | ---> 15
| |
-----------

Answer: Number. I avoided Integer for now. What kind of thing does it spit out? Answer: Number.

I then showed him this next example, which should arguably have a section of its own:

-----------
| |
D ---> | +5 | ---> I
| |
-----------

This machine looks exactly the same as the machine above, except you put Letters into it, and it spits out letters. We also did months. Both are interesting because they have to loop around. I didn't have to teach him that, he just got it.

Then I introduced a formal notation for types:

-----------
7 ---> | |
| + | ---> 19
12 ---> | |
-----------

(Number,Number) -> Number

And introduced machines that change the type (he had seen it already, but only with YES! and NO! This, I think, is a better example):

-----------
| |
5 ---> | To Letter | ---> E
| |
-----------

Number -> Letter


He understands the notation and can write it if I give him slots to fill in like this:

______ -> ______

or

(______, ______) -> _______


Things I don't know how to teach, Yet


I certainly didn't try to teach him anything about machines that take in machines and spit out machines. Also, some of my boxes were polymorphic, but I don't think I know how to explain that to him.

For now, I think Jacy and I will just do this same stuff for a while, reinforcing it. I'm not sure what the best thing to teach him next is. Some of the stuff here I've skimped on writing up, and we actually spent more time on than it seems.

Anyway, this was all really, really fun, for both of us.

Thursday, May 14, 2009

Refactoring in Scala

Another really long post, but why not!

My Lexer is now approaching 500 lines with roughly 350 lines of test code. Maybe that's still trivial, but it doesn't feel trivial. Maybe it doesn't feel trivial because it does a lot. Those 500 lines of code cover so many features. Yesterday night I started to add a bunch more features, and realized a few spots needed some heavy refactoring to achieve what I want. Here's what I wanted to add:

  • New line support. Yes, my previous lexer could only lex one line at a time. Syntax errors indicated the offset, but not the line. A parser should know what line and file the token occurred in.

  • Better handling and testing of syntax errors.

  • Separate Scala lexing from Lexer core, providing a reusable Lexer framework for Lexing any language.


I had a few other goals as well. I wanted to have the tests be much more organized, mostly arranging for testing individual Finders. So far, everything was lumped together in one giant test. I wanted to do a general cleanup of my Lexer trait, and extract the mutability from it. And as usual, I wanted to make sure things were nice and clean and short and readable. I have a few goals on my plate that I didn't get to as well. Comments, Strings, Chars, peeking at the next token without mutating. Some of that stuff might require the Finders to be away of a Context of some sort. I hope not. Also, maybe I could write a preprocessor to replace any comments with white space.

Multi Line Support


Anyway, lets get to the new line support. Like I said, the main goal was to have better syntax error support, knowing the line and the offset, not just the offset. But the parser needs to more information that just the offset, as well. In order to do this, I decided to introduce the idea of a CompilationUnit, which holds all the code. Previously, the Lexer just worked with a single array.

Old code:

trait Lexer extends SyntaxErrorHander {
var chars: FunctionalCharArray
var currentIndex = 0

New code:

trait Lexer extends CompilationUnitWalker with SyntaxErrorHander {
var unit: CompilationUnit

Notice something else important here. The old code held its chars and was responsible for handing its own mutation, and the mutation was a bit cluttered in with the lexing logic. It wasn't horrible, but not great. Now the mutation is all self contained in CompilationUnitWalker, who manages the current line and current character pointers. I'll save the listing for that trait until the end, as its where all the ugliness lives. But that's the good news, its no longer in Lexer. For now though, it helps to see the interface for that trait:

trait CompilationUnitWalkerInterface {
var unit: CompilationUnit
def skipChar: Unit
def skipChars(i: Int): Unit
def skipLine: Unit
def currentLine: FunctionalCharArray
def currentLineNumber: Int
def currentOffset: Int
def eofReached_? : Boolean
}

The finders need the current line, Tokens and errors need currentLineNumber and currentOffset, and the Lexer itself needs to know if it processed the whole file (eofReached_?), and needs commands to move in the array (the skip methods). Since some of this stuff wasn't in the original code, this was a refactoring and a feature add at the same time. No big deal though, just pointing it out.

I'll show the remainder of the Lexer code in the next section.

Syntax Error Handling


We're a little bit closer to getting syntax errors with line numbers in them. I didn't show the SyntaxErrorHandler trait in the last post, but it was trivial. It just took a SyntaxError Token (which I've decided was stupid because SyntaxErrors are not tokens), and it just printed them out. Here's both now:

Old code:

trait SyntaxErrorHander{
def syntaxError(error:SyntaxError):Unit = {
println(error)
}
}

New code:

trait SyntaxErrorHander{
def syntaxError(unit: CompilationUnit,
lineNumber: Int, offset: Int):Unit = {
println("Syntax Error(" + unit + ", line number:" +
lineNumber + ", offset:" + offset + ")")
}
}

Not much different, but now the handler takes the CompilationUnit, and the line number and offset. Those are all we really need to handle the requirement. And I do it like this:

trait AddToCompilationUnitSyntaxErrorHander extends SyntaxErrorHander{
override def syntaxError(unit: CompilationUnit,
lineNumber: Int, offset: Int):Unit = {
super.syntaxError(unit, lineNumber, offset)
unit.syntaxError(lineNumber, offset)
}
}

The code for calling into the syntax error handler in the Lexer actually shaped up nicely as well:

trait Lexer extends CompilationUnitWalker with SyntaxErrorHander {
var unit: CompilationUnit

def finders: List[FunctionalCharArray => Option[Lexeme]]

def nextToken: Token = {

// if we've already lexed the whole file, get the f out.
if (eofReached_?) return EOF(currentLineNumber)

// find all possible Lexemes
val matches =
finders.map(f => f(currentLine)).filter(_.isDefined).map(_.get)

// if we've found no lexemes, syntax error! adjust and try again.
if (matches.isEmpty) {
syntaxError(unit, currentLineNumber, currentOffset)
skipChar
return nextToken
}

// the longest match found should be on top...i think...
val longestMatch = matches.sort(_.size >= _.size).head

// deal with the best lexeme
handleLexeme(longestMatch)
}

def handleLexeme(lex: Lexeme) = lex match {
case NewLine => {
skipLine
nextToken
}
case WhiteSpace(_) => {
skipChar
nextToken
}
case lex => {
val indexOfLexeme = currentOffset
skipChars(lex.data.length)
Token(lex, currentLineNumber, indexOfLexeme)
}
}
}

In the middle of the nextToken we check to see if matches is empty. If it is, we haven't found any Lexeme's and we call the syntaxError method with the current CompilationUnit (which we hold), and the currentLineNumber, currentOffset which we get from the walker. Done!

Also notice the handleLexeme method at the end, because this completes the first requirement. When it creates a new Token, it passes the currentLineNumber as well. I'm still debating putting the CompilationUnit in the Token as well. Certainly a parser will know which unit its working on, but it might be helpful. I guess I'll wait til I start writing my parser to find out.

Refactoring Towards Reuse


Refactoring Towards Reuse was by far the most broad and complicated requirement/idea in the set - I thought. But as it turns out, it only took a few minutes total. All I had to do was move a few files, change a few names, honestly almost nothing.

First, I created a package called scala under lex, created a Scala file called ScalaLexer.scala, and started pulling in anything that looked Scala specific. I ended up with this:

trait ScalaLexer extends Lexer with ScalaFinders with
AddToCompilationUnitSyntaxErrorHander

trait ScalaFinders extends
ScalaCharFinders with NumberFinder with ScalaIdentifinder with
WhiteSpaceFinder with CaseClassFinder with CaseObjectFinder

case object CaseClass extends SimpleLexeme("case class")
case object CaseObject extends SimpleLexeme("case object")

object ScalaIOLBuiler extends IdentifierOnlyLexerBuilder{
def apply(cu: CompilationUnit) = {
new ScalaIOL{ var unit: CompilationUnit = cu }
}
}
trait ScalaIOL extends IdentifierOnlyLexer with ScalaIdentifinder

case class ScalaTwoPartIdentifinder(
w1: String, w2: String, l: Lexeme) extends
TwoPartIdentifinder(ScalaIOLBuiler, w1,w2,l)

trait CaseClassFinder extends LexemeFinder {
override def finders =
super.finders :::
List(ScalaTwoPartIdentifinder("case", "class", CaseClass).find _)
}

trait CaseObjectFinder extends LexemeFinder {
override def finders =
super.finders :::
List(ScalaTwoPartIdentifinder("case", "object", CaseObject).find _)
}

trait ScalaCharFinders extends LexemeFinder {
override def finders = super.finders ::: CharFinders(
Underscore, Comma, Dot, Eq, Semi, Colon, LeftParen,
RightParen, LeftCurly, RightCurly, LeftBrace, RightBrace
)
}

I also moved Identifinder into the Scala package as well, since it's Scala specific. All in all, I didn't have to do much. It's only about 30 lines of code and a lot of it can probably be cleaned up and reduced further.

Now, if I wanted to lex Java I'd probably have to write a similar 30 lines of code, plus a JavaIdentifinder. I should do it, because then I'd find further areas for factoring out common code.

So I guess in my next installment I'll do just that, and, I'll show the testing at that point too. The testing is coming along really nicely. At some point I plan to contrast this code with the real Scala lexer code, and the tests for it (I still have to find those).

For now I'll give a quick snippit of the testing I did for handling multiple lines, but I'll wait to explain it:

trait NewLineTests extends LexerTest{

test("new lines") {
checkCode("x,\ny",
(0,0) -> Identifier("x"),
(0,1) -> Comma,
(1,0) -> Identifier("y"))
}

test("more new lines"){
checkCode("""x,
y,
z""",
(0,0) -> Identifier("x"),
(0,1) -> Comma,
(1,4) -> Identifier("y"),
(1,5) -> Comma,
(2,6) -> Identifier("z"))
}
}

That's all for now. It's late.

Scala over Ruby - My Debate Ends

I've maintained posts about things I like in better about Scala and/or Ruby, but, it's time to put an end to the debate. I've been doing Ruby every day for almost 6 months now, and finally conclude that for me, it just doesn't feel nearly as nice as Scala. It's not ever close, to be blunt.

My main reason? Refactoring. Over the years I've become very good at refactoring; I've actually been called a refactoring machine. I have a lot of experience refactoring really, really terrible code. Yes, this sucks, I have a history of picking the wrong job. Fortunately though, I also have some experience refactoring really nice code as well. The code that I'm working on now in Ruby is fairly new, and quite good. All the Scala code that I write is pretty good (room for improvement, but pretty good).

I can refactor so easily in Scala with huge confidence and I can't do that in Ruby at all. In Ruby:

  • It takes a long time to make major refactorings.

  • I'm never fully confident in my refactorings and almost always get annoyed by runtime errors.

  • Sometimes the stack traces are all messed up and I can't figure out where my actual error is.

  • Most code you run into isn't going to have enough test coverage to help anyway.

  • For all the preachers of TDD and instant feedback - tests untimately/inevitably take (far) longer than the compiler. So I really don't have the instant feedback I need.

  • If you skip refactorings in Ruby because they are hard, your code becomes harder and harder to refactor. This goes for statically typed languages as well, but, it's a lot harder with Ruby, especially considering the points above.

  • Ad infinitum...

In my next post I'm going to cover a bunch of major refactorings I just did to my Lexer, and how easy it was.

Tuesday, May 12, 2009

A Scala Lexer

This is a really long post, but IM(not so)HO, its really fun!

Sunday I decided, "I'm going to write my own Lexer for Scala", simply because I felt like it. I'm only about 8 hours in, but I've got a lot of functionality. And its only about 300 lines of code so far. Now, that 300 lines doesn't cover nearly all of Scala. It's lacking XML handling, and comments, has limited support for numbers, and String literals, Unicode characters and error handling. But, it does have a lot of nice features including the ability to understand most (maybe all) identifiers, and operators.

There are other really nice things about it as well. First, its very functional - it uses HOF's to recognize tokens and it's mostly immutable, with its mutable state isolated to one very small area. Second, the tests for it are simple end elegant. Third, its quite reusable, as I've managed to leverage several components from within itself.

Before I get to the implementation, let me show how to use it. It can be done very simply at the command line. Example:

scala> import com.joshcough.compiler.lex._
import com.joshcough.compiler.lex._

scala> val code = "x"
code: java.lang.String = x

scala> val lexer = new SimpleLexer(code) with Identifinder
lexer: com.joshcough.compiler.lex.SimpleLexer with com.joshcough.compiler.lex.Identifinder = SimpleLexer(FunctionalCharArray([C@b6aea4))

scala> lexer.nextToken
res0: com.joshcough.compiler.lex.Token = Token(Identifier(x),0)

scala> lexer.nextToken
res1: com.joshcough.compiler.lex.Token = EOF(1)

In this code I create a Lexer only capable of recognizing identifiers by mixing in the trait Identifinder (and yes, that trait name is awesome). The "code" that I use here is simply the String x. When I ask the lexer for its tokens, it gives me back exactly what I'd expect, Token(Identifier(x),0), and EOF(1). This means it found the identifier x at offset 0, and found EOF at index 1. Simple, but obviously not yet very useful. Luckily, the identifier recognition is much more powerful. Let's see some more examples (From now on, where appropriate, I'll remove redundant type info from the output or replace it with "...", and simply indent the output):

scala> lexer lexing "-->"

scala> lexer.nextToken
Token(Identifier(-->),0)

scala> lexer.nextToken
EOF(3)

scala> lexer lexing "_ewrk_1212_adf435445_^^^"

scala> lexer.nextToken
Token(Identifier(_ewrk_1212_adf435445_^^^),0)

Much better. But what happens if I pass in something that the lexer shouldn't understand?

scala> lexer lexing "1"

scala> lexer.nextToken
SyntaxError(0)
res4: com.joshcough.compiler.lex.Token = EOF(1)

Not too terrible. "1" isn't a valid identifier and since this lexer only recognizes identifiers, it spits out "SyntaxError(0)", because it did not recognize the input starting at position 0. Now, this is actual System.out output. That's all you get by default. The actual Scala compiler simply adds an error to the Compilation unit with the offset. There really isn't much a lexer can do with errors. Later, I'll show how we can cache them off as well.

The last lexer failed to recognize numbers. That's easy enough to fix. Simply mix in NumberFinder:

scala> val lexer = new SimpleLexer(code) with
Identifinder with NumberFinder

scala> lexer lexing "123"

scala> lexer.nextToken
Token(Number(123),0)

scala> lexer lexing "-->"

scala> lexer.nextToken
Token(Identifier(-->),0)

Now we can recognize identifiers and numbers. Next up we'll start recognizing all of these single characters: [ ] { } ( ) , . : _ = ; This as just as easy as the others - just mix in CharFinders

scala>lexer lexing "(x,y);"

scala> lexer.nextToken
Token(LeftParen,0)

scala> lexer.nextToken
Token(Identifier(x),1)

scala> lexer.nextToken
Token(Comma,2)

scala> lexer.nextToken
Token(Identifier(y),3)

scala> lexer.nextToken
Token(RightParen,4)

scala> lexer.nextToken
Token(Semi,5)

Notice that we haven't seen any white space yet. In fact, the lexer here isn't yet capable of handling white space. To do that - you guessed it - mix in WhiteSpaceFinder.

scala> val lexer = new SimpleLexer(code) with Identifinder with
NumberFinder with CharFinders with WhiteSpaceFinder
lexer lexing "( x , y );"

scala> lexer.nextToken
Token(LeftParen,0)

scala> lexer.nextToken
Token(Identifier(x),2)

scala> lexer.nextToken
Token(Comma,4)

scala> lexer.nextToken
Token(Identifier(y),6)

scala> lexer.nextToken
Token(RightParen,8)

scala> lexer.nextToken
Token(Semi,9)

Notice that like any good lexer, this lexer simply skips over the white space, handing you the next actual token at its right location.

There are a few more, such as RocketFinder (finds =>), CaseClassFinder (finds "case class" and returns it as one Lexeme instead of two seperate identifiers), and CaseObjectFinder (which does exactly the same thing). But, I'm not going to show examples of their usage. I want to get to the implementation.

Lexer Implementation


Earlier I made claim that the Lexer was "mostly immutable". Of course, it has to be somewhat mutable to return something different to the parser on successive calls to nextToken. To do this, the Lexer uses a FunctionCharArray, which is just a very thin wrapper around Array[Char].

case class FunctionalCharArray(chars: Array[Char]) {
def skip(i: Int) = new FunctionalCharArray(chars.drop(i))
def size = chars.size
def nextChar: Option[Char] = get(0)

def get(index: Int): Option[Char] = {
if (chars.size > index) Some(chars(index)) else None
}
}

Taking a brief peek into the Lexer we find that its the FunctionCharArray that is mutated:

trait Lexer extends SyntaxErrorHander {

var chars: FunctionalCharArray

var currentIndex = 0

def finders: List[FunctionalCharArray => Option[Lexeme]]

...

There are a few other interesting bits here. The lexer also mutates its currentIndex. It puts the currentIndex into the Tokens when Lexemes are found. We'll see how that works just a bit later on. Most importantly though, is the fact that almost everything else runs through the Lexer's finders.

The finders are actually quite simple. Given a FunctionalCharArray, the finders return an Option[Lexeme]. They return Some(Lexeme...) if they found what they expected at the very beginning of the array, and None otherwise. The remainder of the array is simply ignored by the finder. A few simple examples should help.
  • The Identifinder returns Some(Identifier(x)) when its given the an array containing 'x' as its only element, and None if the first character in the array is a character that can't legally start an identifier.
  • The NumberFinder returns Some(Number(123)) when given this array: ['1', '2', '3', ' ', '+', ' ', '7']
  • The CharFinders only ever look at the first element in the array.

To fully demonstrate just how simple they are, we need to see some. Have at you!

trait WhiteSpaceFinder extends LexemeFinder {
override def finders = super.finders ::: List(findWhiteSpace _)

def findWhiteSpace(chars: FunctionalCharArray): Option[Lexeme] = {
chars.nextChar match {
case Some(' ') | Some('\t') | Some('\n') => Some(WhiteSpaceLex)
case _ => None
}
}
}

The first thing to notice about this finder is that it immediately adds a finder method to it's super. This is what enabled us to keep mixing in finder after finder in the beginning of this post.

The finder that it adds is its own findWhiteSpace method. As I explained, it takes a FunctionalCharArray and returns an Option[Lexeme]. In this case, it peeks at the top character in the array, and it that char is a space, tab or new line, it returns a Some with the Lexeme, and if its not, it returns None. Simple.

That one is pretty trivial though, it only needs to look at one character. Let's take a look at one that's more involved. Here is a version of CaseClassFinder. It's not the exact implementation, but I've only changed it slightly to help demonstrate.

class IdentifierOnlyLexer(var chars: FunctionalCharArray) extends
Lexer with WhiteSpaceFinder with Identifinder

trait CaseClassFinder extends LexemeFinder {
override def finders = super.finders ::: List(findCaseClass _)

def findCaseClass(chars: FunctionalCharArray) = {
if (chars.size < 10) None
else {
val lexer = new IdentifierOnlyLexer(chars)

(lexer.nextToken, lexer.nextToken) match {
case (Token(Identifier("case"), _),
Token(Identifier("class"), _)) => {
Some(CaseClass)
}
case _ => None
}
}
}
}

I love this example because it's small, but not trivial, and because it reuses components in the lex package. Before I explain, let's think about what a CaseClassFinder should do. If it see's "case", followed by "class" at the beginning of the array, then it should return the Lexeme CaseClass. Otherwise, it should return None. But, its more important to think about that in a slightly different way:

If a CaseClassFinder finds the Identifier "case" followed by the Identifier "class", then it should return CaseClass.

Well, we already know how to find identifiers...it was the first thing I showed in this post! As it turns out, that's exactly how CaseClassFinder does it as well. It creates a new Lexer with its input, an IdentifierOnlyLexer (technically, it could fire up a Lexer with all the finders, but it's overkill). It then asks the Lexer for its first two tokens, and if those Tokens contain the Lexemes Identifier("case"), and Identifier("class") the it knows its found a case class. BAM!

The Indentifinder and NumberFinder traits are too long to show here. I'll post a link to the codebase.

Now, we have a few things to do to wrap up. I still need to show the how the Lexer itself actually uses the finders. And, the astute reader might have already realized that two finders might both return a Lexeme. For example, given the Array [':',':'], the Colon finder would return Colon, and the Identifinder would return Identifier(::). The Lexer handles this case easily. The real value must be the longer of the two. Simple as that. Now lets take a look at all of Lexer.

Lexer Implementation (for real this time!)


1 trait Lexer extends SyntaxErrorHander {
2
3 var chars: FunctionalCharArray
4 var currentIndex = 0
5
6 def finders: List[FunctionalCharArray => Option[Lexeme]]
7
8 def nextToken: Token = {
9
10 def skip(i: Int) {currentIndex += i; chars = chars.skip(i)}
11
12 val token = nextToken_including_whitespace_and_syntaxerror
13
14 token match {
15 case SyntaxError(i) => {
16 syntaxError(SyntaxError(i))
17 skip(1)
18 nextToken
19 }
20 case EOF(_) => token
21 case Token(WhiteSpaceLex, _) => skip(1); nextToken
22 case Token(lex, _) => skip(lex.data.length); token
23 }
24 }
25
26 private def nextToken_including_whitespace_and_syntaxerror = {
27 if (chars.nextChar == None) EOF(currentIndex)
28 else {
29 val lexemesFound: List[Lexeme] = {
30 finders.map(f => f(chars)).filter(_ match {
31 case Some(t) => true
32 case _ => false
33 }).map(_.get)
34 }
35
36 if (lexemesFound.size == 0) return SyntaxError(currentIndex)
37
38 val lexemesSorted =
39 lexemesFound.sort(_.data.size >= _.data.size)
40
41 Token(lexemesSorted(0), currentIndex)
42 }
43 }
44
45 def lexing(s: String): Lexer = {
46 lexing(new FunctionalCharArray(s.toCharArray))
47 }
48
49 def lexing(cs: FunctionalCharArray): Lexer = {
50 chars = cs; currentIndex = 0; this
51 }
52}

First, look at the nextToken_including_whitespace_and_syntaxerror which starts on line 26. This method returns the best Token possible (like :: vs : ), but more importantly, it always returns a Token. It returns WhiteSpace and SyntaxError tokens as well. The calling method, nextToken, is the guy in charge of filtering those out, resetting, and continuing. But, well get to that in just a second. For now, lets enumerate the nextToken_including_whitespace_and_syntaxerror methods steps.

  1. The first thing it does is check if there are any characters left. If there are none, it simply returns EOF. Any more calls to nextToken will continue to return EOF until the Lexer gets new data (via the lexing methods).
  2. It then calls all of its finders with the current array: finders.map(f => f(chars))
  3. It then immediately filters the results, because its only interested in any finders that actually returned Some(Lexeme...), and not none. Of course.
  4. At that point (line 36) it checks to see that someone actually found a Lexeme (if (lexemesFound.size == 0) return SyntaxError(currentIndex)). If no finders found anything, the we must have some unrecognized character in the input. Syntax Error!
  5. On line 38 it sorts the Lexemes by size, aiming to get the largest.
  6. Finally it returns the largest token (ignoring any other matches).

Now I'll explain the nextToken method, and then wrap up.

The first action nextToken takes is to call the nextToken_including_whitespace_and_syntaxerror on line 12. With that token, it must make a decision.

  1. If it receives a SyntaxError, then call the notify the syntax error handler of it, and in an attempt to reset/restart - move ahead one space in the input, and recur.
  2. If it receives an EOF, just hand it over to the Parser, it should know what to do with it.
  3. If it receives WhiteSpace, then also move ahead one space and recur. There might be a better strategy here (and/or for syntax errors), but this one is simple, and works ok.
  4. Finally, if it gets a legitimate token, then move the entire array ahead by the length of the Lexeme that was found, and return the token. The next time nextToken is called, it should start at the new location, one character beyond the end of this token.


Wrapping up, the way I've implemented this, I don't think it will be that difficult at all the add in missing features. XML will probably be a bear, just because. I guess I don't have String literal support at all, so I'll add that next, and Char literals too. They should be mostly straightforward.


If you're actually still reading, awesome! Thanks! I hope you've learned something. You can find all the actual source code here. Not bad for about 8 hours huh?

Oh BTW, my only two references for this were the Dragon Book (2 ed), and the Scala compiler itself.

Bye!

Monday, May 04, 2009

Typed Lambda Calculus For Me (And Maybe You)

I'm reading chapter 10 of Lambda-Calculus and Combinators by J. Roger Hindley and Jonathan P. Seldin, and I wanted to put down some notes. These notes are mostly for myself, much in the spirit of why I started this blog - to measure my own progress. However, since I know I have readers I'll write it in a way that people learn something. If people are interested in this though, since I'm already investing time in it, I'd be happy to do into more detail on the beginners stuff and/or some of the history behind this (I might just do it anyway). Also if anyone smarter than me finds any errors here, or has tips to explain it better, please do. Okay Go.

In the book there is this nugget:

(λx:p->ø->t.
(λy:p->ø.
(λz:p.((x:p->ø->t z:p):ø->t (y:p->ø z:p):ø):t):p->t
):A
):B

You (the reader) are supposed to solve this for A and B.

For those of you unfamiliar with lambda calculus, and/or typed lambda calculus, this is a lot easier than it looks. What I'll do here is try to explain the steps nice and slow, and then at the bottom maybe try to write this in Scala to demonstrate.

Explicit Type Information


Ok, the important thing here is that we are given a lot of information. Here's a list of the things we definitely know about the types in the statement above:
  1. x:p->ø->t

  2. y:p->ø

  3. z:p

In more detail:
  1. x is a function that takes a p and returns a function that takes an ø and returns a t.

  2. y is a function that takes a p and returns an ø

  3. z is simply a p

Or in Scala:
  1. val x:p=>ø=>t

  2. val y:p=>ø

  3. val z:p

How do we know this? Well, that's basic typed lambda calculus. Here are a couple of very basic notes:
  1. The statement (λx:p->ø->t.M) is a function that takes an argument, x, which is of type p->ø->t. Arrows imply function types and hence x is a function that takes a p and returns a function that takes an ø and returns a t.

  2. The return value of the function in this tiny example is M (above is much more complicated, but that's what were getting to).

Function Application


Now, using what we know about the types, and the rest of the statement, what else can we figure out? Hmm, I guess we should talk about function application, by looking at the innermost bit of the statement above.

We see this: (x:p->ø->t z:p):ø->t, and this means we are calling the function x with the argument z. Remember that everything after the colons are simply types. Ignoring the types, we could just write (x z) which is basic function application in lambda calulus. In Scala this would be simply x(z). And what type does calling x yield? Not to beat it into your brain, but remember that x is a function that takes a p and returns a function that takes an ø and returns a t. Calling it with a p will yield a function that takes an ø and returns a t, or more simply ø->t.

In the original statement, we had (x:p->ø->t z:p):ø->t, and this is correct, as I just explained.

That isn't the only function application in the statement either. Right next to it we see (y:p->ø z:p):ø. Does this check out? Remember that y is a function that takes a p and returns an ø, and z is a p. Perfect. Applying y with argument x yields a ø.

Now, that fact that I said "right next to it" is meaningful. Now that we've resolved the first two applications, we end up with another application! The result of the first application ((x:p->ø->t z:p):ø->t) was ø->t, and the result of the second was a ø, and these also match. Applying ø->t to an ø results in a t. And now we've finished this big inner statement: ((x:p->ø->t z:p):ø->t (y:p->ø z:p):ø):t.

Building Functions


Now that we've looked at the internal parts of the statement boiling it all down to a t, we can start to work our way outwords. Recall that λ is used to denote a function. λx:Int."hey" is a function that takes an Int (named x) and returns a String, "hey". Fully typed this would be (λx:Int."hey"):String

Working our way one step outwards from our function applications, we have (λz:p.((x:p->ø->t z:p):ø->t (y:p->ø z:p):ø):t):p->t. Recall that we already determined the inner part to be a t. We can now thing of this as (λz:p.(some arbitrary t)):p->t. And that's correct. This is a function that takes a argument z of type p, and returns a t. Hence its type: p->t.

Solving for A


Lets take a look back to remember what were trying to solve. One thing we need to find is A:

(λy:p->ø.
(λz:p.((x:p->ø->t z:p):ø->t (y:p->ø z:p):ø):t):p->t
):A

We've already determined the inner part to be correct - its p->t. Using what we just learned about creating functions with λ, and what we learned about types earlier, we know that (λy:p->ø. whatever):A builds a function that takes an argument, y:p-ø. That is - an argument y of type function from p to ø.

Additionally, this new function returns an A. But remember that the type of a function is not just its return type, it is (input type -> output type). So A must be the functions input type (which we know to be the type of the argument y, or p->ø) -> the functions return type (which is the return type of the inner statement, which we solved earlier to be p->t).

A is (p->ø)->(p->t).

Solving for B


To Solve for B, we do exactly what we did solving for A.

Recall:
(λx:p->ø->t.
(λy:p->ø.
(λz:p.((x:p->ø->t z:p):ø->t (y:p->ø z:p):ø):t):p->t
):A
):B

Or:
(λx:p->ø->t.whatever:A):B

Clearly the first half of B (the input type) is (p->ø->t). The output type is the type of whatever, which (from looking just above) is A.

So B is (p->ø->t)->A

And fully expanded:

B is (p->ø->t)->(p->ø)->(p->t)

Formal Listing


This came directly from the book. I'm not sure if I can even legally put this all here. If I hear that I can't, I'll immediately take it down. All the information here is explained in detail above.


x:p->o->t z:p y:p->o z:p
----------------- ---------------
(xz):o->t (yz):o
--------------------------------------------
(xy(yz)):t
-----------------------
(λx:p.(xy(yz))t):p->t
-----------------------------------------------
(λy:p->o((λx:p.(xy(yz))t)p->t)): (p->o)->(p->t)
------------------------------------------------
(λx:p->o->t ((λy:p->o((λx:p.(xy(yz))t)p->t)):
(p->o)->(p->t))): (p->o->t)->(p->o)->(p->t)

The Scala Code To Prove It


trait Lambda {
type p
type ø
type t
type A = (p=>ø)=>(p=>t)
type B = (p=>ø=>t)=>A

val f:B = {(x: p => ø => t) => {(y: p => ø) => {(z: p) => {(x(z)(y(z)))}}}}
}