parboiled for Scala

Posted by Mathias in [parboiled]

16 Aug 2010

The recently released parboiled version 0.9.8 comes with a brand-new feature that will hopefully make parboiled an attractive tool for an even larger developer community: a Scala facade.
Before v0.9.8 parboiled was a pure-Java library written mainly for Java developers. It basically consisted of two things, an efficient recursive-descent PEG parsing engine and an internal Java DSL for parser rule building. Starting with version 0.9.8 there is now a third component: an internal Scala DSL enabling Scala developers to define the rules of their PEG parsers in a way that on the surface looks very similar to Scala’s own parser combinators.

In fact parboiled for Scala shares more with Scala’s parser combinators than just looks. Both frameworks provide a solution to the same problem: creating parsers for arbitrary input directly in Scala, with minimal effort and a flat learning-curve. And both frameworks are similarly expressive with the regard to the formal language classes they describe. However, under the surface the differences are quite significant.
The quite elegant and surprisingly compact implementation of Scalas parser combinators comes with a few unfortunate drawbacks that can limit its applicability to real-world applications:

  1. Poor performance.
    This might be the biggest issue for many users. The reason behind Scalas combinator parser inefficiency is not the theoretically exponential runtime due to excessive backtracking (which has been fixed anyway in Scala 2.8 with the introduction of “packrat” support). It is the downside of the very functional programming model that doesn’t separate parser construction from parser execution (i.e. input analysis). During every parsing run any non-trivial Scala combinator parser creates and recreates many, many sub parsers along with a great number of other supporting JVM objects effectively bogging down the JVM with object creation and garbage collection.

  2. Limited error reporting
    As Martin Odersky writes in “Programming in Scala”: «Error reporting for parsers is somewhat of a black art.». I.e., it is difficult to get right. Whereby it is unclear when a specific solution is “right”, because there is no objectively “correct” error message. It all depends on the users needs. However, as a long-term developer I know that error messages can rarely be too informative.
    The default behavior of the combinator library is to report the first parse error found and issue a rather misleading “expected <something>” message. ”<something>” being one of the potentially many different things that the specific language expects at the given error location. For applications displaying parse error messages to end users this is probably not good enough, which is why the combinator library offers different ways of tweaking error message creation. However, all of these require manual work and can be difficult and time-consuming for larger, more complex grammars, something that conflicts with the original goal of being easy-to-use and quick-to-learn.

  3. No error recovery
    Along the lines of the second point: a parsers ability to continue parsing beyond the first error location makes an application much more real-word ready. Imagine a syntax highlighter that only works up to first syntax error.
    The combinator library currently doesn’t do any parse error recovery.

  4. Non-standard grammar rules
    When starting the development of a grammar for a custom DSL it is always a good idea to not reinvent the wheel. Turning to existing grammars for ideas and common solutions, sometimes even using one as a starting template, can be a huge time-saver. Unfortunately the rule primitives of Scalas parser combinators do not easily map onto existing “standards”, so that it is more difficult to transcribe, for example, existing ANTLR LL(*) grammars or grammars in “standard” PEG notation.

Being built on radically different foundations parboiled beats Scalas parser combinators in each of these four areas without significant differences in ease of use or grammar compactness.

This is what you get when building a PEG parser with parboiled for Scala:

  • An internal Scala DSL for efficiently defining your parser rules
    Very similar to the combinator notation. Some differences in the way parser actions are specified.
  • Good parsing performance
    The existing Java 1.6 parser (currently only available in the parboiled for Java DSL) parses about 50’000 lines of Java code (ca. 1.8 Mio. characters) per second on one core of an 2.4 GHz Intel i5 (under OS/X Java 6).
  • Good error reporting
    parboiled contains some rather sophisticated logic for building meaningful error messages. Effectively tells you all possible “expectees” at the error location.
  • Proper error recovery
    parboiled can recover from parse errors and continue parsing to the very end of the input to report all syntax errors. Intelligently finds the best recovery for single character errors.
  • Rule primitives mapping directly to “standard” PEG notation
    Even though PEGs are not as widespread as for example LL(*) grammars parboiled users can directly benefit from a growing range of existing grammars to jump-start their own parsing projects.

parboiled separates parser construction from rule execution. In a first step your grammar DSL creates a structure of parser rule objects that exists independently from the actual parsing input. The parsing engine can then apply that rule structure to any given input very efficiently. Apart from a significant performance benefit this basic principle has a number of other advantages that I might go into more detail for in another post.

So, what does a parboiled for Scala parser look like?
Here is the “standard” calculator example complete with parser actions:

class SimpleCalculator extends Parser {

  def Expression: Rule1[Int] = rule {
    Term ~ zeroOrMore(
      "+" ~ Term ~~> ((a:Int, b) => a + b)
    | "-" ~ Term ~~> ((a:Int, b) => a - b)
    )
  }

  def Term = rule {
    Factor ~ zeroOrMore(
      "*" ~ Factor ~~> ((a:Int, b) => a * b)
    | "/" ~ Factor ~~> ((a:Int, b) => a / b)
    )
  }

  def Factor = rule { Number | Parens }

  def Parens = rule { "(" ~ Expression ~ ")" }

  def Number = rule { oneOrMore("0" - "9") ~> (_.toInt) }
}

You can find other examples (like a complete JSON parser, a Java 6 parser and a complete Markdown processor (the latter two written with parboiled for Java)) as well as the full documentation on the parboiled website.

parboiled is actively being developed, it’s core and the Java DSL are already being used by numerous projects for all kinds of parsing applications. The Scala facade is still new and therefore somewhat experimental. There are certainly quite a few things that can be added and improved. All feedback, especially (but not only) from people with more Scala experience than myself, is very welcome. There is also a brand-new mailing list / forum that you can use.

I hope you consider parboiled a worthy addition to the Scala ecosystem.

Cheers,
Mathias

View Comments