Imagine a bacon-wrapped Ferrari. Still not better than our free technical reports.
See all our reports

Parsing Lambda Calculus in Scala

Introduction

In a previous post we introduced the pure untyped λ-calculus. Summed up in one sentence: it’s the smallest universal programming language with only a few language constructs (anonymous functions, variables and function applications), and is useful in understanding computing at a hardware-independent level.

In this post we will implement the beginnings of a λ-calculus interpreter consisting of an AST, a parser and a basic read-print-loop to exercise the parser. Adding evaluation to get a functioning REPL will be left for later. We’ll do the implementation in Scala, and demonstrate some Scala features/libraries in the process. You could of course follow along in your favourite language (hopefully it has a parser combinator library). The full source code is available on GitHub , with commits roughly corresponding to code snippets in this post.

Somewhat appropriately, tomorrow (June 14th) would be the 110th birthday of Alonzo Church, who invented λ-calculus.

The λ-calculus is a good tool to study compilers/interpreters because it’s such a simple language. We’re not writing a full compiler of course, and not going deep into theory either, but we will implement some parts that are usually found in a compiler:

  • Parser – turns source code into a model understandable to the rest of the interpreter
  • AST (Abstract Syntax Tree) – the model for the code, with syntax details removed
  • Semantic actions - construction of the model during parsing
  • Pretty printing – turning the model back into syntax

If you’ve never written a parser, I would recommend reading another parser combinator tutorial, because we skip over many of the details and deeper explanations of concepts. Scala features are also glanced over and not covered in depth. We recently published a report on JVM languages that included some basic Scala examples (pages 12-14 in the PDF), so you might want to check it out if you don’t know Scala that well. Or check out some other Scala tutorials on the web.

Abstract Syntax Tree

We start with defining the AST, because it’s really small and simple. An Abstract Syntax Tree is a representation of programs where the exact syntax has been abstracted out. When a parser produces an AST, it removes the syntax, but keeps the semantics of the source program.

To recap, λ-calculus expressions are made of three elements:

  • lambda-abstractions (anonymous functions)
  • variables
  • function applications

In Scala, it seems most logical to use a sealed hierarchy of case classes to represent this. A sealed hierarchy helps the compiler warn us if we have missing cases in pattern matches — encountering a missing case with no catch-all would throw a MatchError.

Pattern matching with dumb data classes seem best here, as we will have several operations with the AST, but the cases are always the same — we want closed data types, but open functions (see the “expression problem”).

Using shorter names, this is one way to implement the AST:

sealed trait Expr
case class Lambda(arg: Var, body: Expr) extends Expr
case class Var(name: String) extends Expr
case class Apply(fun: Expr, arg: Expr) extends Expr

For a Lambda, we are interested in the name of the argument and the body of the function. For a Var, we just need to know the name, and for Apply, we can have any Expr on both left and right sides. With this AST model, we can construct some lambda expressions:

val id = Lambda(Var("x"), Var("x"))
val one = Lambda(Var("s"), Lambda(Var("z"), Apply(Var("s"), Var("z"))))

and we can pattern match on them

expr match {
  case Lambda(arg, body) => ...
  case Var(name)         => ...
  case Apply(fun, arg)   => ...
}

Pretty Printing

Let’s put this to use and get pretty printing out of the way — it takes the AST model and adds the syntax back, producing valid source code as output again. It’s easier to turn the AST back into syntax than to turn syntax into AST, so we’ll do the printing before parsing. Of course, pretty printing cannot necessarily reproduce the exact original syntax, because that information is usually removed by the parser. So instead it has to guesstimate how someone might have written the program that is now represented as AST.

To implement the pretty printer, we can make use of custom String interpolation (new in Scala 2.10) to make the printing code look similar to the real syntax.

class PrettyPrinter {
  def apply(expr: Expr): String = expr match {
    case Lambda(arg, body) => p"λ$arg.$body"
    case Apply(fun, arg)   => p"$fun $arg"
    case Var(name)         => s"$name"
  }
 
  implicit class PrettyPrinting(val sc: StringContext) {
    def p(args: Expr*) = sc.s((args map parensIfNeeded):_*)
  }
 
  def parensIfNeeded(expr: Expr) = expr match {
    case Var(name) => name
    case _         => "(" + apply(expr) + ")"
  }
 
}

We call the main printing function apply, a magic name that lets us use a PrettyPrinter instance as a function (somewhat convenient). apply does a pattern match over all types of Expr and produces a String for each. The implicit conversion from StringContext to PrettyPrinting allows us to use the p method from PrettyPrinting as a prefix to String literals. Each $-expression in the String will be passed to p, which recursively pretty-prints them and adds parentheses if needed (this adds more parentheses than we would by hand, but I leave improving this as an exercise to you). Here’s a simple test to see that it works:

val one = Lambda(Var("s"), Lambda(Var("z"), Apply(Var("s"), Var("z"))))
val pretty = new PrettyPrinter()
println(pretty(one))

Output:

λs.(λz.(s z))

Now that we have a way of turning AST into source code, let’s get to turning source code into AST, because constructing programs in AST form is rather ugly and unreadable. We have to write a parser and Scala has a nice library for this: parser combinators (scala.util.parsing).

Parser Combinators

Parser combinators (also called combinator parsers) are probably the easiest way to learn how to implement a parser, because of the simple ways in which you can compose larger parsers from smaller ones. Parser combinators are a huge topic on their own, but here we’ll just introduce as much as we need to implement our λ-calculus parser.

Conceptually, a parser is a function from Input to Output: Input -> Output. The input is usually plain text as a character stream. The Output type can be anything (say T), but parsing can succeed or fail, so the type is more like:

Input -> (T | Failure) where | is disjoint union of two types

To combine parsers, we need the remaining input after one parser has parsed some part of it:

Input -> (T | Failure) & Input where & is product of two types (pair)

The Input in the return type is for “rest of the input”. For example if “x” was parsed from “x y”, the rest is “ y”. For the specific parser types in Scala, take a look at the Parsers class.

Source code for a programming language is usually parsed in two stages:

  1. a lexical analyser (lexer) takes a character stream and produces a token stream
  2. a syntactic analyser (parser) takes the token stream and produces an AST

The Scala parser combinator library already has a lexer we can use to produce tokens such as identifiers and keywords, separated by whitespace (which is ignored in the produced token stream). To use it, we extend StdTokenParsers, implement its abstract members, and tell the lexer which characters are delimiters.

import scala.util.parsing.combinator.syntactical.StdTokenParsers
import scala.util.parsing.combinator.lexical.StdLexical
 
class LambdaParser extends StdTokenParsers {
  type Tokens = StdLexical
  val lexical = new StdLexical
  lexical.delimiters ++= Seq("λ", ".", "(", ")")
}

The simplest kind of parse function could just parse a variable name from a String.

def parse(source: String): ParseResult[String] = {
  val tokens = new lexical.Scanner(source)
  phrase(ident)(tokens)
}

In this method, we give the source to the lexical token scanner, which produces input for the syntactic parsers. ident is a built-in parser that parses a single identifier (variable name). phrase takes an existing parser and says: “this parser succeeds only if it consumes all of the input”. Now we can do this

new LambdaParser().parse("hello")

and get back “hello”. The intermediate token stream would just have one token: Identifier("hello"). The parse result could be either Success, or NoSuccess and we do a pattern match to find out, printing the result or the error.

def main(args: Array[String]) {
  val parser = new LambdaParser()
  import parser.{Success, NoSuccess}
  parser.parse("hello world") match {
    case Success(expr, _) => println("Parsed: " + expr)
    case err: NoSuccess   => println(err)
  }
}

In case of an error, we’ll get a message which shows us where exactly we had a problem in the program. For example, parsing “hello world” gives us:

[1.7] failure: end of input expected
hello world
      ^

Our parser wasn’t expecting anything after the identifier “hello” because we used the phrase parser which requires consumption of all input; the token stream in this case would have two identifiers: Identifier("hello"), Identifier("world").

Grammar for λ-calculus

Time to write the real parser! First, we should know what the grammar for λ-calculus is. Without going deeply into grammars, here is what it looks like described in EBNF:

expr        = lambda | application | variable | parens;
lambda      = "λ", variable, ".", expr;
application = expr, expr;
variable    = identifier;
parens      = "(", expr, ")";

This grammar describes how the syntactic elements can be combined to form programs in the language. Each line is a production rule saying how symbols can be combined in a particular way to produce another valid symbol. In this case we only need two combinators: comma means “sequence” and | means “or”. So the line

lambda = "λ", variable, ".", expr;

can be read as: a lambda is the character "λ" followed by a variable, followed by the character ".", followed by an expression. You should notice that this definition is recursive, and the expression at the end could be a lambda again because: expression is a lambda, or an application, or a variable, or parentheses.

Turns out we can pretty much mechanically translate this grammar into Scala code. But first, lets see how the Scala parsers can be combined. These are the combinators we need:

  • "x" (String literal) parse the keyword/delimiter "x"
  • a ~ b parse a, and then parse b (sequence)
  • a | b parse a, or parse b (or)
  • a ^^ f parse a, apply function f to result (semantic action)

There are more combinators, but we don’t need more for parsing λ-calculus. We can copy & paste the EBNF grammar description into the parser, prefix lines with def and replace , with ~:

def expr        = lambda | application | variable | parens
def lambda      = "λ" ~ variable ~ "." ~ expr
def application = expr ~ expr
def variable    = ident
def parens      = "(" ~ expr ~ ")"

Now we have functions for each of the production rules, but this doesn’t compile yet: “recursive method expr needs result type”. Types for recursive functions cannot be inferred and we need to provide them. The type of variable is Parser[String] while the types of the other parsers are Parser[~[...]] (~ is also the type for the result of sequencing, representing a pair of results). The type of expr would be Parser[Any] as Any is the common supertype for String and ~. But it would be difficult to deal with a mess of String and ~ objects after we’ve parsed them.

Instead we can turn syntax into AST immediately during parsing, and give expr the type Parser[Expr]. We can add semantic actions (creation of AST nodes) to each parser with the ^^ combinator and an anonymous function that turns a successful parse result into a specific AST node. For example, application and variable could be done like this:

def application = expr ~ expr ^^ { (r: ~[Expr, Expr]) => Apply(r._1, r._2) }
def variable    = ident       ^^ { (r: String)        => Var(r) }

Variable is easy, the type of result for ident is String and we just pass it to Var. For Apply, the ~ result type is a pair of the results of the sequence. We can use a pattern match directly as an anonymous partial function and let the types be inferred to make the code more readable.

def application = expr ~ expr      ^^ { case left ~ right  => Apply(left, right) }
def variable    = ident            ^^ { case name          => Var(name) }
def parens      = "(" ~ expr ~ ")" ^^ { case "(" ~ e ~ ")" => e }

Adding an action for parens as well, we notice a repetition of the parentheses characters. We can ignore the parentheses in the result by using combinators ~> and <~ instead of ~. ~> means “sequence, forgetting the left hand side after parsing it“ and <~ is the same, but mirrored.

def parens = "(" ~> expr <~ ")" ^^ { case e => e }

With that, the action for parens became the identity function { case e => e }, and we can drop it completely.

Making use of the fact that case class constructors are functions, the Var case becomes even simpler. Adding the lambda action as well, the parser code should now compile:

def expr: Parser[Expr] = lambda | application | variable | parens
def lambda             = "λ" ~> variable ~ "." ~ expr ^^ { case v ~ "." ~ e  => Lambda(v, e) }
def application        = expr ~ expr                  ^^ { case left ~ right => Apply(left, right) }
def variable           = ident                        ^^ Var
def parens             = "(" ~> expr <~ ")"

But does it run? Lets change the parse method from using ident to expr (and the return type accordingly) and see what happens when parsing the identity function λx.x:

java.lang.StackOverflowError

Oops! WTF happened? What happened is left-recursion happened! Specifically in

def expr = ... | application | ...
def application = expr ~ ...

When application is parsed it tries to parse expr, which tries to parse application, which tries to parse expr and this continues infinitely (or until the stack overflows). When you have left-recursion with parser combinators, you might have to change the grammar to get rid of it. But thankfully, there is a way to make left-recursion work: “packrat parsers”.

Packrat Parsers

Let’s not go into details of packrat parsers here — suffice it to say that they are useful for at least two things: speeding up parser combinators (by memoizing results) and allowing left-recursion. To introduce them into LambdaParser, mix in the PackratParsers trait, turn each Parser[T] into a PackratParser[T], and use lazy vals instead of defs. Conversion to PackratParser is done implicitly, but we have to indicate that we want it to happen, by adding type annotations. For brevity, we introduce a type alias P for PackratParser.

type P[+T] = PackratParser[T]
lazy val expr: P[Expr]         = lambda | application | variable | parens
lazy val lambda: P[Lambda]     = "λ" ~> variable ~ "." ~ expr ^^
                                 { case v ~ "." ~ e  => Lambda(v, e) }
lazy val application: P[Apply] = expr ~ expr ^^
                                 { case left ~ right => Apply(left, right) }
lazy val variable: P[Var]      = ident ^^ Var
lazy val parens: P[Expr]       = "(" ~> expr <~ ")"

Looks a bit more noisier now, but we can live with it. The transformation from grammar has still been mostly mechanical. Let’s try to parse λx.x again:

[1.3] failure: ``('' expected but `.' found
λx.x
  ^

Hmm… what’s the problem now? Do we have something wobbly in our grammar? Turns out that our lexer handles λx as a single identifier because both characters are valid identifier characters. StdLexical, while convenient, didn’t quite cut it for our syntax. We can still make it work by modifying it just a bit. Lets create a subclass named LambdaLexer. Looking at StdLexical, the token parser — which produces tokens for the token stream — first tries to parse identChar’s into an Identifier. 'λ' is an identChar because it’s a letter. So lets make it not so by claiming that 'λ' is not a letter.

class LambdaLexer extends StdLexical {
  override def letter = elem("letter", c => c.isLetter && c != 'λ')
}

Replacing our use of StdLexical with LambdaLexer, we can now parse the identity function λx.x:

Parsed: Lambda(Var(x),Var(x))

Seems like we have a working parser! Go ahead and try some more expressions from the previous post as well.

REPL – Read Eval Print Loop

What next? It would be annoying to have to change the main program and launch it every time we want to parse something different, so lets implement a REPL (Read-Eval-Print-Loop). We’re going to save the “Eval” part for a later post and just print out what we parsed for each line of user input.

object LambdaREPL {
  val parser = new LambdaParser
  def main(args: Array[String]) = loop()
 
  def loop() {
    while (true) {
      val exprSrc = readLine("λ> ")
      import parser.{Success, NoSuccess}
      parser.parse(exprSrc) match {
        case Success(expr, _) => println("Parsed: " + expr)
        case err: NoSuccess   => println(err)
      }
    }
  }
}

Yay! We have some kind of loop going, even if it just prints the data structure that it parsed. The implementation is very simple: in an infinite loop, ask for user input, parse it and print the result.

λ> λx.x
Parsed: Lambda(Var(x),Var(x))
λ> λs.λz.s z
Parsed: Lambda(Var(s),Lambda(Var(z),Apply(Var(s),Var(z))))

NOTE: If you use an IDE and the IDE process has a different default character encoding than UTF-8, you might have a problem inputting the lambda character in your IDE’s console. (I added -Dfile.encoding=UTF-8 to eclipse.ini to make it work)

Typing the λ character is not great on most keyboards, so we’ll add an alias for it: \ as in Haskell. We add "\\" to delimiters and replace "λ" in the lambda parser with ("λ" | "\\").

λ> \x.x
Parsed: Lambda(Var(x),Var(x))

Ok, it was good to see some examples of the parsed ASTs being printed out, but it would look better if plugged in the PrettyPrinter into the REPL:

object LambdaREPL {
  val pretty = new PrettyPrinter
  ...
      case Success(expr, _) => println("Parsed: " + pretty(expr))
  ...
}

Let’s run it again and try out some more expressions:

λ> \t.\f.t
Parsed: λt.(λf.t)
λ> \s.\z.z
Parsed: λs.(λz.z)
λ> \s.\z.s (s z)
Parsed: λs.(λz.(s (s z)))

Conclusion

This concludes our post on parsing the lambda calculus in Scala. We had a quick look at what some parts of a lambda calculus interpreter could look like, what an AST is, how to pretty print the AST to get the source code for it, how to parse source code and turn it into an AST.

We introduced parser combinators and implemented a parser for λ-calculus in only about 15 lines of meaningful code! We also created the basic framework for a REPL, which is still missing one piece.

Tune in next time, when we complete the REPL with an evaluator along with more talk about variable substitution and evaluation strategies for the λ-calculus.

  • Roman

    Thank you for that nice article! It really helped me to dive into Scala. It was actually the first tutorial I did in Scala.

  • Danilo Domínguez

    Thanks for the article. I wrote a parser for a small language in Haskell parser monads and I was trying to mimic the same behavior in Scala.