Rust - Writing Parsers With nom Parser Combinator Framework

I've been working on my new Rust side-project for several months now, and I've got some learnings to share. The project is called pq - it's a command-line tool to parse and query log files as time series. It comes with its own domain-specific language that is highly influenced by PromQL. A typical pq usage may look like this:

tail -f /var/log/nginx/access.log | pq '
/...some fancy regex.../
| map {
    .0 as ip,
    .1:ts,
    .2 as method,
    .3:str as status_code,
    .4 as content_len
  }
| select topk(
      10,
      sum(
          count_over_time(
              __line__{method="GET", status_code="200"}[1s]
          )
      ) by (ip)
  )
| to_json
'

pq has many components, including various log parsing strategies and a pretty sophisticated query execution engine. But surprisingly or not, about half of the time I've put into this project so far was dedicated to writing the parser of the pq's own query language. To be honest, when I was starting the project, I didn't see that coming...

Luckily, writing a parser in Rust was mostly a pleasant experience, thanks to a crate concisely named nom. Although learning how to write parsers with nom wasn't completely seamless. So here is my journey.

Level up your server-side game — join 6,500 engineers getting insightful learning materials straight to their inbox.

Parser combinators

As I've been taught in the university, writing a parser starts from defining a formal grammar 🥱

The grammar describes a language in a rather declarative way. ~~Yay, everybody loves functional programming!~~ The grammar is then passed to a parser generator software like Yacc or Bison that produces a code of the new, fully-generated parser. For instance, here is the real PromQL grammar (looks fine) and the corresponding generated parser (looks terrifying).

Well, for some reason, I've never felt excited about this approach...

On the other end of the spectrum resides an entirely manual way of writing parsers. First, the input text needs to be tokenized into keywords, identifiers, number and string literals, etc. A particular sequence of tokens can then be recognized as a certain expression, such as a function call or a binary operation. While this technique appeals more to me, it would require writing a substantial amount of low-level tokenization routines. The parser logic itself would also have quite some boilerplate code for matching various combinations of tokens into higher-layer constructs - syntax tree nodes.

Parser combinators to the rescue!

Apparently, there are libraries that free you from reimplementing the low-level repeating logic over and over again. They provide some primitive functions like take_next_5_bytes(), recognize_a_float(), or recognize_a_word() for tokenization and other functions like pair(f1, f2), or alt(opt1, opt2, opt3, ...) for recursive combination of other parsing routines.

A parser developer can pack one or many such parsers and/or combinators into a domain-specific function giving it a meaningful name. A combination of such higher-level functions forms a language grammar. But the grammar is expressed in the target code now! In the case of Rust, nom is one of such parser combinator frameworks. From the very real pq code:

use nom::{
    branch::alt,
    bytes::complete::{tag, tag_no_case},
    character::complete::{alpha1, alphanumeric1},
    combinator::recognize,
    multi::many0,
    sequence::pair,
    IResult
};

/// Parses a label matching clause into a syntax tree node `LabelMatching`:
///  - on()
///  - on(label1, label2, label3, ...)
///  - ignoring()
///  - ignoring(label1, label2, label3, ...)
fn label_matching(input: &str) -> IResult<LabelMatching> {
    let (rest, kind) = alt((tag_no_case("on"), tag_no_case("ignoring")))(input)?;
    let (rest, labels) = grouping_labels(rest)?;
    Ok((rest, LabelMatching::new(kind, labels)))
}

fn grouping_labels(input: &str) -> IResult<Vec<String>> {
    separated_list(
        '(',          // opener
        ')',          // closer
        ',',          // delimiter
        label_identifier,
    )(input)
}

/// Parses strings like [a-zA-Z_][a-zA-Z0-9_]*
fn label_identifier(input: &str) -> IResult<String> {
    // The true power of combination!
    let (rest, m) = recognize(pair(
        alt((alpha1, tag("_"))),
        many0(alt((alphanumeric1, tag("_")))),
    ))(input)?;
    Ok((rest, String::from(*m.fragment())))
}

As you can see, the resulting code looks pretty declarative. However, nom doesn't restrict you from writing imperative logic when it'd simplify the implementation. For instance, I couldn't find a better way to specify a regular expression in pq programs than going with /.../ syntax. But some regular expressions may contain slashes. So, I had to use escaping. And then I cut the corner with the following hack:

use nom::{bytes::complete::take, character::complete::char, IResult};

/// Parses a /<regex>/ into a string:
///  - /\d+/ becomes String::new("\d+")
///  - /\w+\/some\/path\s/ becomes String::new("\w+/some/path\s")
fn decoder_regex(input: &str) -> IResult<String> {
    let (rest, _) = char('/')(input)?;

    // Some imperative ad-hoc parsing logic...
    if let Some(end_pos) = find_first_unescaped_slash(*rest) {
        let (rest, regex) = take(end_pos)(rest)?;
        let (rest, _) = char('/')(rest)?;
        return Ok((rest, (*regex).replace(r#"\/"#, "/")));
    }

    Err(nom::Err::Failure(ParseError::partial(
        "regex",
        "closing '/' symbol",
        rest,
    )))
}

Nom parser example

Ok, I hope it's enough to sell you the beautifulness of the parser combinator approach. Now let's try to write a tiny parser using nom and see what problems we may stumble upon on the way.

The high-level idea of combinators is probably clear at the moment, but it's always good to understand the lower-level details before jumping to the actual parser implementation. Oversimplifying, a typical nom's routine can be described as follows:

it takes in a string that needs to be parsed
it attempts to match the beginning of the string with some pattern
if a match is found, it returns Ok((remainder, parsed_value))
if no match is found, it returns an Err(), so alternate parsers, if any, could have a try.

Example with a simple word match. The parsed value here is a substring:

use nom::{bytes::complete::tag, IResult};

fn foo(s: &str) -> IResult<&str, &str> {
    tag("foo")(s)
}

fn main() {
    assert_eq!(foo("foo bar"), Ok((" bar", "foo")));
    assert!(foo("1234567").is_err());
}

Building on the idea - adding a combinator separated_pair. The parsed value becomes a tuple of substrings:

use nom::{
    bytes::complete::tag,
    character::complete::char,
    sequence::separated_pair,
    IResult
};

fn foo(s: &str) -> IResult<&str, &str> {
    tag("foo")(s)
}

fn bar(s: &str) -> IResult<&str, &str> {
    tag("bar")(s)
}

fn foo_bar(s: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(foo, char(' '), bar)(s)
}

fn main() {
    assert_eq!(foo_bar("foo bar"), Ok(("", ("foo", "bar"))));
    assert!(foo_bar("1234567").is_err());
}

Adding even more combinators:

use nom::{
    branch::alt,
    bytes::complete::tag,
    character::complete::{char, digit1},
    combinator::recognize,
    sequence::separated_pair,
    IResult,
};

fn foo(s: &str) -> IResult<&str, &str> {
    tag("foo")(s)
}

fn bar(s: &str) -> IResult<&str, &str> {
    tag("bar")(s)
}

fn foo_bar(s: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(foo, char(' '), bar)(s)
}

fn foo_bar_or_number(s: &str) -> IResult<&str, &str> {
    alt((recognize(foo_bar), recognize(digit1)))(s)
}

fn main() {
    assert_eq!(foo_bar_or_number("foo bar"), Ok(("", "foo bar")));
    assert_eq!(foo_bar_or_number("1234567"), Ok(("", "1234567")));
}

The foo_bar_or_number() parser alternates between the foo_bar() and digit1 subparsers. But since the first one returns a tuple ("foo", "bar") while the second one returns just a single substring "1234567", the result needs to coalesce to a common type. recognize() utility does the trick here - it discards the actual parsed value and returns the matched substring as a result.

Well, that's probably it for the basics!

As usual, there is more than one way to skin a cat - nom comes with lots of such small parsers and combinators, and they can be combined in many ways to express the same parsing logic. Personally, I find it pretty rewarding to think of various ways certain parsers can be put together to parse a given grammar. But I cannot imagine doing it without this super-helpful page listing all the most frequently used routines by groups.

Pitfall - nom error handling

Ok, enough of the made-up stuff. Let's mess with a real-world problem. Much like pq does, I'll try to parse a simplified subset of PromQL here. Or, more precisely, a vector selector expression that looks like this:

http_requests_total{method="GET", status_code=~"2.."}

First, a code to parse an identifier. It'll be used to match metric and label names in the vector selector:

/// Parses an identifier (in the most common sense).
///
/// ```
/// # use nom_parser_example::identifier;
/// #
/// # fn main() {
/// assert_eq!(identifier("foo bar"), Ok((" bar", "foo")));
/// assert_eq!(identifier("_12 bar"), Ok((" bar", "_12")));
/// assert_eq!(identifier("012 bar").is_err(), true);
/// # }
/// ```
pub fn identifier(input: &str) -> IResult<&str, &str> {
    // [a-zA-Z_][a-zA-Z0-9_]*
    let (rest, m) = recognize(pair(
        alt((alpha1, tag("_"))),
        many0(alt((alphanumeric1, tag("_")))),
    ))(input)?;
    Ok((rest, m))
}

Now, a naive string literal parsing. Don't use this code in production - it doesn't handle escape sequences! (checkout this nom example for more)

/// Parses a quoted string.
///
/// ```
/// # use nom_parser_example::naive::string_literal;
/// #
/// # fn main() {
/// assert_eq!(string_literal(r#""" baz"#), Ok((" baz", "")));
/// assert_eq!(string_literal(r#""foo bar" baz"#), Ok((" baz", "foo bar")));
/// assert_eq!(string_literal("qux").is_err(), true);
/// # }
/// ```
pub fn string_literal(input: &str) -> IResult<&str, &str> {
    let (rest, m) = recognize(
        delimited(char('"'), many0(is_not("\"")), char('"'))
    )(input)?;
    Ok((rest, &m[1..m.len() - 1]))
}

Parsing a label matching operator with a handy value mapper:

#[derive(Clone, Copy, Debug, PartialEq)]
pub enum MatchOp {
    Eql,   // ==
    Neq,   // !=
    EqlRe, // =~
    NeqRe, // !~
}

/// Parses a label matching operator.
///
/// ```
/// # use nom_parser_example::naive::{match_op, MatchOp};
/// #
/// # fn main() {
/// assert_eq!(match_op("=="), Ok(("", MatchOp::Eql)));
/// assert_eq!(match_op("!="), Ok(("", MatchOp::Neq)));
/// assert_eq!(match_op("=~"), Ok(("", MatchOp::EqlRe)));
/// assert_eq!(match_op("!~"), Ok(("", MatchOp::NeqRe)));
/// # }
/// ```
pub fn match_op(input: &str) -> IResult<&str, MatchOp> {
    alt((
        value(MatchOp::Eql, tag("==")),
        value(MatchOp::Neq, tag("!=")),
        value(MatchOp::EqlRe, tag("=~")),
        value(MatchOp::NeqRe, tag("!~")),
    ))(input)
}

Note that the match_op() parser also uses a custom struct for the result type.

One more beast, bear with me:

#[derive(Clone, Copy, Debug, PartialEq)]
pub struct LabelMatch<'a> {
    pub label: &'a str,
    pub value: &'a str,
    pub op: MatchOp,
}

/// Parses a label matching operation.
///
/// ```
/// # use nom_parser_example::naive::{label_match, LabelMatch, MatchOp};
/// #
/// # fn main() {
/// assert_eq!(
///     label_match(r#"foo=="bar""#),
///     Ok(("", LabelMatch { label: "foo", value: "bar", op: MatchOp::Eql }))
/// );
/// assert_eq!(
///     label_match(r#"foo!~"2..""#),
///     Ok(("", LabelMatch { label: "foo", value: "2..", op: MatchOp::NeqRe }))
/// );
/// # }
/// ```
pub fn label_match(input: &str) -> IResult<&str, LabelMatch> {
    let (rest, label) = identifier(input)?;
    let (rest, op) = match_op(rest)?;
    let (rest, value) = string_literal(rest)?;
    Ok((rest, LabelMatch { label, value, op }))
}

Last but not least, putting all the above parsers together to match a vector selector expression:

#[derive(Clone, Debug, PartialEq)]
pub struct VectorSelector<'a> {
    pub metric: &'a str,
    pub labels: Vec<LabelMatch<'a>>,
}

/// Parses a vector selector.
///
/// ```
/// # use nom_parser_example::naive::vector_selector;
/// #
/// # fn main() {
/// assert_eq!(vector_selector(r#"foo{bar=="baz",qux!="123"}"#).is_ok(), true);
/// # }
/// ```
pub fn vector_selector(input: &str) -> IResult<&str, VectorSelector> {
    let (rest, metric) = identifier(input)?;
    let (rest, _) = char('{')(rest)?;
    let (rest, labels) = separated_list0(char(','), label_match)(rest)?;
    let (rest, _) = char('}')(rest)?;
    Ok((rest, VectorSelector { metric, labels }))
}

Lovely! Every individual parser looks nice and readable, and the compound vector_selector parser also seems pretty concise. So, where is the pitfall?

Try running the following code:

println!("{:?}", vector_selector(r#"foo {bar=="baz",qux="123",abc="mux"}"#));

Yes, I intentionally skipped the whitespace handling here. But when I was writing the real parser, I actually truly forgot about it 🙈 The above snippet produces the following error:

Err(Error(Error { input: " {bar==\"baz\",qux=\"123\",abc=\"mux\"}", code: Char }))

Well, for me, it was super hard to decipher it. The message contains some part of the input string and an obscure code: Char part. I'd guess that Char comes from the char() parser. But we actually used it in several places, so which one of them is failing? Parser errors are supposed to be user-friendly. It's an end-user who will face them most of the time. However, this error is not even developer-friendly...

Error position with nom_locate

Having an exact error position in the message would definitely simplify the parser development. But apparently, nom doesn't track it. Luckily, there is a tiny crate called nom_locate that does the trick.

The input type of the nom parsers and combinators is actually generic. By default, it uses &str, and it seems to work well for demo purposes. But I quickly found it too limited. nom_locate introduces a LocatedSpan type that can be used as an input type for the parsers. Apparently, it's just a thin wrapper that augments the original type with the location information.

The required code adjustment is actually quite small. First, add some type aliases for the sake of brevity:

use nom_locate::LocatedSpan;

pub type Span<'a> = LocatedSpan<&'a str>;

pub type IResult<'a, O> = nom::IResult<Span<'a>, O>;

And then simply adjust the input and output types of the parsers. Example:

fn identifier(input: Span) -> IResult<&str> {}

fn string_literal(input: Span) -> IResult<&str> {}

fn match_op(input: Span) -> IResult<MatchOp> {}

fn label_match(input: Span) -> IResult<LabelMatch> {}

fn vector_selector(input: Span) -> IResult<VectorSelector> {}

The LocatedSpan values can be defer-ed back to the enclosed type. So wherever the back conversion from LocatedSpan to &str is needed, a simple *value can be used.

And, the moment of truth:

println!("{:#?}", vector_selector(Span::new(r#"foo {bar=="baz",qux!="123"}"#)));

// Outputs
// Err(
//     Error(
//         Error {
//             input: LocatedSpan {
//                 offset: 3,
//                 line: 1,
//                 fragment: " {bar==\"baz\",qux!=\"123\"}",
//                 extra: (),
//             },
//             code: Char,
//         },
//     ),
// )

Yay! 🎉 The exact position where the parser failed is in the error message now!

User-friendly message with custom error type

Much like the input type, the error type is also generic. The above message allows targeting the error position quickly. However, it's still by no means a user-friendly error. Without looking at the code, it's not clear which of the parsers is actually failed. Also, there is no room for the suggested input changes.

That's how a custom nom error can be added:

use nom;

#[derive(Debug, PartialEq)]
pub struct ParseError<'a> {
    span: Span<'a>,
    message: Option<String>,
}

impl<'a> ParseError<'a> {
    pub fn new(message: String, span: Span<'a>) -> Self {
        Self { span, message: Some(message) }
    }

    pub fn span(&self) -> &Span { &self.span }

    pub fn line(&self) -> u32 { self.span().location_line() }

    pub fn offset(&self) -> usize { self.span().location_offset() }
}

// That's what makes it nom-compatible.
impl<'a> nom::error::ParseError<Span<'a>> for ParseError<'a> {
    fn from_error_kind(input: Span<'a>, kind: nom::error::ErrorKind) -> Self {
        Self::new(format!("parse error {:?}", kind), input)
    }

    fn append(_input: Span<'a>, _kind: nom::error::ErrorKind, other: Self) -> Self {
        other
    }

    fn from_char(input: Span<'a>, c: char) -> Self {
        Self::new(format!("unexpected character '{}'", c), input)
    }
}

The result type can also be slightly adjusted:

pub type IResult<'a, O> = nom::IResult<Span<'a>, O, ParseError<'a>>;

So, now it's possible to have a code like that:

pub fn vector_selector(input: Span) -> IResult<VectorSelector> {
    let (rest, metric) = identifier(input).map_err(|_| {
        nom::Err::Error(ParseError::new(
            "vector_selector must start from a metric name".to_owned(),
            input,
        ))
    })?;
    let (rest, _) = char('{')(rest)?;
    let (rest, labels) = separated_list0(char(','), label_match)(rest)?;
    let (rest, _) = char('}')(rest)?;
    Ok((rest, VectorSelector { metric, labels }))
}

Partial parsing error

nom parsing approach resembles the depth-first search. The combination of parsers forms a search tree that is greedily traversed from the root node, trying to find the longest possible match in the input string. If a parser in the current branch returns an error, nom backtracks trying to find an alternate parser on every step back. While backtracking, nom simply discards error messages of the failed parsers. If all the parsers fail, only the error message produced by the top-most level parser will be returned to the user.

For instance, if we are parsing a string like http_requests_total{foo=="bar", qux="123"}, the error will be about the malformed vector selector as a whole. Although as a human being, we can identify the problem much more precisely here - one of the label matches has an incomplete match operation =.

Maybe the label_match() parser even detected it and returned a proper error message. However, by default nom would just discard it while backtracking. Luckily, there is a way to improve the parser and make it properly reporting partial results.

nom's error type is actually compound:

/// The `Err` enum indicates the parser was not successful
///
/// It has three cases:
///
/// * `Incomplete` indicates that more data is needed to decide. The `Needed` enum
/// can contain how many additional bytes are necessary. If you are sure your parser
/// is working on full data, you can wrap your parser with the `complete` combinator
/// to transform that case in `Error`
/// * `Error` means some parser did not succeed, but another one might (as an example,
/// when testing different branches of an `alt` combinator)
/// * `Failure` indicates an unrecoverable error. As an example, if you recognize a prefix
/// to decide on the next parser to apply, and that parser fails, you know there's no need
/// to try other parsers, you were already in the right branch, so the data is invalid
///
#[derive(Debug, Clone, PartialEq)]
#[allow(missing_doc_code_examples)]
pub enum Err<E> {
  /// There was not enough data
  Incomplete(Needed),
  /// The parser had an error (recoverable)
  Error(E),
  /// The parser had an unrecoverable error: we got to the right
  /// branch and we know other branches won't work, so backtrack
  /// as fast as possible
  Failure(E),
}

By default, parsers return nom::Err::Error(<actual error>). However, it's possible to use nom::Err::Failure(<actual error>) to make nom backtrack immediately right up to the root. Example:

pub fn label_match(input: Span) -> IResult<LabelMatch> {
    // If it doesn't start from an identifier, it's simply not a label matching.
    let (rest, label) = identifier(input)?;

    // But here it's already a partial match error. We are quite certain that
    // it's a label matching, but probably it's malformed or incomplete.
    let (rest, op) = match match_op(rest) {
        Ok((rest, op)) => (rest, op),
        Err(nom::Err::Error(_)) => {
            return Err(nom::Err::Failure(ParseError::new(
                "label matching - expected on of '==', '!=', '=~', '!~'".to_owned(),
                rest,
            )))
        }
        Err(e) => return Err(e),
    };

    // Same as above, but we are even more certain.
    let (rest, value) = match string_literal(rest) {
        Ok((rest, value)) => (rest, value),
        Err(nom::Err::Error(_)) => {
            return Err(nom::Err::Failure(ParseError::new(
                "label matching - expected string literal".to_owned(),
                rest,
            )))
        }
        Err(e) => return Err(e),
    };

    Ok((rest, LabelMatch { label, value, op }))
}

The upper-level parsers need to be aware of that trick as well. I find the following error handling pattern quite handy:

    // ...somewhere in the middle of a parser

    let (rest, something) = match subparser(rest) {
        Ok((rest, x)) => (rest, x),
        Err(nom::Err::Error(_)) => {
            // Complain about partial match.
        }
        // That's highly likely a lower-level Failure.
        // Just propagate it.
        Err(e) => return Err(e),
    };

Loosing this Err(e) => return Err(e) match arm periodically costed me countless hours of debugging 🤭

Instead of conclusion

That's probably already too many code snippets for a single article. But I hope it covers the most needed parts to kickstart the parser development with nom.

While working on pq, I faced another interesting parsing problem. PromQL's binary operation grammar seems to be left-recursive. For instance, an addition is defined as <expr> := <exrp> + <expr>, where left-hand- and right-hand sides both in turn can be additions as well. Thus, to define an expression, one needs to define an expression... Implementing such grammar in nom naively immediately leads to an infinite recursion that, of course, results in stack overflow 🤯 But that's a totally different story...

P.S. All the code snippets from the article can be found here.

Level up your server-side game — join 6,500 engineers getting insightful learning materials straight to their inbox: