Nyu Tutorial

This tutorial assumes a familiarity with parsing expression grammars (PEGs).

Creating a grammar

Grammars must begin with a “@grammar” statement which is used to determines the namespace the output grammar lives within.

@grammar MyGrammar

Grammars can be nested within other namespaces in the output code.

@module grandparent_namespace.parent_namespace
@grammar MyGrammar

Comments are C++-style.

// a single line comment

/* a comment that ends at
   a star followed by a slash */

Unlike PEGs whitespace is optional between elements in a sequence. The “Spacing“ rule is used to determine what is considered whitespace (and thus should contain no sequences itself). The symbol ^ can be used to seperate elements in a sequence that should not be joined.

@grammar javascript
Spacing        <- \s+  // \s matches tabs, vertical tabs and spaces.
Identifier     <- [a-zA-Z_] ^ [a-zA-Z0-9_]*
VarDeclaration <- "var" Identifier

TODO: More Sections Go Here

Using % to parse joined sequences

@grammar list
Spacing    <- \s

// * or + repetitions of a character storing parser store a string.
IdSuffix <- [0-9a-zA-Z_]*

// Spacing* is allowed between parsers in a sequence. Spacing* is not
// allowed between parsers joined with ^.
// The storage types of adjacent string/char storing parsers joined with ^
// are collapsed to store a single string, so Id stores a string.
Id <- [a-zA-Z_] ^ IdSuffix

// (P % Q) matches many P joined with Q (with optional Spacing* around
// the Q) and stores vector[storage type of P]
// In Grammar "[", "," and "]" are not stored as they always match the
// same data. (P ^% Q) is the same as (P % Q) but does not allow
// Spacing* to match around Q.
Grammar    <- "[" (Id % ",") "]"

// Grammar stores: vector<string>
//         parses: [ hello, king ]
//             as: vector<string>("hello", "king")

Using / to parse ordered choice and store variant

@grammar js
Spacing     <- \s
Id          <- [a-zA-Z_] ^ [0-9a-zA-Z_]*

// Sequences store tuples.  sub-tuples are broken down into the parent
// tuple type, and a tuple that stores a single type is broken down into
// that type.
// FuncCall stores tuple<string, string>.
FuncCall <- Id "(" Id ")"

// FuncCall stores tuple<string, string> and Id stores string so
// Grammar stores vector< variant<string, tuple<string, string> > >
// Duplicate types are collapsed into a single entry in a variant, and a
// variant that stores a single type is collapsed to that type.
Grammar <- (FuncCall / Id)+

Using <= to create node parsers which store new types

@grammar mathematics_basic
Spacing   <- \s

// <int- stores resulting parsed string as int
Number    < int - [0-9]+

// <= creates a "node parser". Node parsers store a new type with the same
// name as the parsing rule. The storage types of node parsers are not
// flattened into the storage type of including parsers.
Product   <= Number %+ "*"

// %+ is like % but at least one join item must be stored.
// P %+ Q matches P (Q P)* and stores vector[storage type of P]
Addition  <= Product %+ "+"

// Spacing* is allowed between elements in each P in P+ or P* unless
// P stores a character. P^+ is the same as P+ with no spacing allowed.
Grammar   <- Addition+

// These examples uses [ .. ] to represent a stored list/vector type.
// stores: vector<Addition>
// creates:
//     class Product {
//          vector<int> value_;
//     }
//     class Addition {
//          vector<Product> value_;
//     }
// parses: 4 + 2
// as:     [ Addition[Product[4], Product[2]] ]
// parses: 4
// as:     [ Addition[Product[4]] ]
// parses: 4 7
// as:     [ Addition[Product[4]], Addition[Product[7]]]

Node parsers can refer to themselves

@grammar mathematics_basic
Spacing    <- \s
Number     < int - [0-9]+

// Expression recursively refers to itself through Term. This would not be
// possible if Expression was not a node parser as in this case the type of
// Expression would recursively depend on its own storage type.
// Term stores variant<int, Expression>
Term       <= Number / "(" Expression ")"
Product    <= Number %+ "*"
Addition   <= Product %+ "+"
Expression <= Addition
Grammar    <- Expression+

Using |% to recursively collapse unjoined nodes

@grammar mathematics
Spacing   <- \s
Number    < int - [0-9]+

// |% parses the same data as %+ but stores parsed data differently.
// The node type is only created if the join matches more than one item,
// otherwise it stores the item to the left of the |%. The resulting type
// of the whole expression is a variant that can store either type.
// In this case Product stores variant<int, Product> which is populated
// with either int or Product depending on whether Number matches one or
// many times.
Product   <= Number  |% "*"

// Node parsers that use |+, |* or |% cannot refer to themselves.

// Addition stores: variant<Addition, storage type of Product>
//          expand: variant<Addition, variant<Product, int>>
//        collapse: variant<Addition, Product, int>
Addition  <= Product |% "+"

Grammar   <- Addition+

// stores: vector< variant<int, Product, Addition> >
// creates:
//     class Product {
//          vector<int> value_;
//     }
//     class Addition {
//          vector<variant<int, Product>> value_;
//     }
// parses: 4 + 2
// as:     [ Addition[4, 2] ]
// parses: 4
// as:     [ 4 ]
// parses: 4 7
// as:     [ 4, 7 ]
// parses: 4 + 2 * 7
// as:     [ Addition[4, Product[2, 7]] ]

Using # to hash data

@grammar hash

Id <- [a-zA-Z]+

// (KeyPair <- Id "=" Id) would store tuple<string, string> but since
// the first identifier begins with "#" then key_value<string, string>
// is stored.
KeyPair <- #Id "=" Id

// (P % Q) would normally store vector<storage type of P>, but when the
// storage type of P is key_value<...> then it stores a vector_hash_map.
// A vector_hash_map stores the order in which items were inserted
// in addition to a hash index which can be used for fast access to
// a stored item based on its key. This storage behaviour is the same
// for all parsers that can store vectors.
Grammar <- "{" KeyPair % "," "}"

// Using { key -> value, ... } to represent the hash map type nyu created
// parses "{ first = hello, second = bye }" as:
// {
//     "first" -> "hello",
//     "second" -> "bye"
// }