Peppa PEG Specification¶
Objectives¶
Peppa PEG aims to be an opinionated PEG that is easy for parsing the source code into an AST.
Peppa PEG is designed to describe a formal language by extending the original PEG formalism with some user-friendly symbols and notations, such as repetition and back reference in regular expression, cut in Python PEG Grammar , left recursion in EBNF, modifiers in Pest, etc.
Spec¶
Peppa PEG is case-sensitive.
A Peppa PEG grammar must be a valid UTF-8 encoded Unicode document.
Comment¶
A hash symbol following by any characters til end of line marks a comment.
comment = "#" (!"\n" .)* "\n"?;
Comments are ignored and can be marked between rules. This is a simple way of including additional information in parallel with the specifications.
For example,
# This is a full-line comment.
rule = "# This is not a comment."; # This is a comment at the end of a line.
Rule Naming¶
The name of a rule is a sequence of characters, beginning with an alphabet or underscore, and followed by a combination of alphabets, digits, or underscores (_).
name = ([a-z]/[A-Z]/"_") ([a-z]/[A-Z]/[0-9]/"_")*;
Rule names are case sensitive.
For example, the rule names listed below are valid:
rule
DIGITS
oct_int
while these are invalid:
0to9
ml-string
Rule Form¶
The primary building block of a Peppa PEG document is the rule.
A rule is defined by the following sequence:
rule = decorator* name "=" expression ";"
where one or more rule decorators may be prefixed before rule names, rule names are on the left-hand side of the equals sign and expressions are one the right-hand side. Rules always ends up with a semicolon (;). The key, equal sign, expressions, and semicolon can be broken over multiple lines. Whitespace is ignored around rule names and between expressions.
For example:
rule = rule2 / "expr4" / "expr5" / "expr6";
@lifted
rule2 = "expr1"
/ "expr2"
/ "expr3"
;
Unspecified expressions are invalid.
rule = ; # INVALID!
Rule Expressions¶
Rule expressions have the following precedence from highest to lowest:
Primary
Repeat
Sequence
Choice
Left Recursion
Since left recursion has the lowest precedence, a rule expression always starts interpreting as a left recursion.
expression = left_recursion;
Primary rule expressions must have one of the following types.
Literal
Insensitive
Range
Reference
Back Reference
Positive
Negative
Dot
Cut
Grouping
primary = literal
/ insensitive
/ range
/ (reference !"=")
/ back_reference
/ positive
/ negative
/ dot
/ cut
/ "(" choice ")"
;
Grouping notion () is strongly advised which will avoid misinterpretation by casual readers. For example,
foobar = (foo / bar) / (goo / par);
Dot¶
Single dot . can match any Unicode character. It’s a syntax sugar for [\u0001-\U0010ffff].
any = .;
Literal¶
The literal matches an exact same string surrounded by double quotes.
For example,
greeting = "hello world";
Unicode is supported:
greeting = "你好,世界";
Emoji can be encoded via Unicode so it is supported:
greeting = "Peppa 🐷";
You can encode ASCII characters via \x followed by 2 hex digits.
greeting = "\x48\x65\x6c\x6c\x6f, world";
You can encode Unicode characters via \u followed by 4 hex digits or \U followed by 8 hex digits. The escape codes must be valid Unicode scalar values.
greeting = "\u4f60\u597D, world\U0000000c";
Range¶
Range matches a single character in range. Ranges are enclosed by [ and ]. The full form is:
range = "[" (
(char "-" char (".." number)?)
/ ("\p{" range_category "}")
) "]";
In the following example, any character between ‘0’ to ‘9’ can match.
digits = [0-9];
The lower and upper character of the range can be not only ASCII characters but also UTF-8 code points. The syntax can be \xXX, \uXXXX or \uXXXXXXXX.
digits = [\u4e00-\u9fff];
The value of lower must be less or equal than the upper.
any = [\U0010ffff-\u0001]; # INVALID!!!
Range supports an optional stride to skip certain amount of characters in the range. In this example, only odd number between ‘0’ to ‘9’ can match.
digits = [0-9..2];
Range can support Unicode general categories and properties by wrapping filters with \p{}, such as C, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, L, Nd, Nl, No, N, Id_Start, Id_Continue, Other_Id_Start, Other_Id_Continue, White space, etc.
range_category = ([a-z] / [A-Z] / [0-9] / "_" / " ")+;
For example,
unicode_letter = [\p{L}]; # Any Unicode Letter (Ll, Lm, Lo, Lt, Lu).
unicode_digit = [\p{Nd}]; # Any Number, decimal digit (Nd).
Whether range category is supported depends on the implementations.
Sequence¶
Sequence matches a sequence of rules in order, e.g. a concatenation of contiguous characters.
The full form is:
sequence = repeat+;
The first sequence element is attempted. If succeeds, the second is attempted, so on and on. If any one of the attempts fails, the match fails.
For example:
greeter = "Hello" " " "world";
Elements enclosed in parentheses are considered as a single element. Thus,
rule = prefix (foo / bar) postfix;
matches either (prefix foo postfix) or (prefix bar postfix), and
rule = prefix foo / bar postfix;
matches either (prefix foo) or (bar postfix).
Choice¶
Choice matches one of the alternatives.
Elements in the choice separated by a forward slash (/) are alternatives. The full form of choice is:
choice = sequence ("/" sequence)*;
The first alternative is attempted. If fails, the second is attempted, so on and on. If any one of the attempts succeeds, the match succeeds. If all attempts fail, the match fails.
For example:
greeter = "Hello World" / "你好,世界" / "Kia Ora";
Left Recursion¶
A left recursion starts with a lhs expression, followed by a | symbol, the same reference to the rule name, and a rhs expression. The full form of left recursion is:
left_recursion = lhs ("|" reference rhs)?;
lhs = choice;
rhs = choice;
Left recursion expression matches left-hand side sub-expression first, then matches right-hand side sub-expression repeatly. Whenever a right-hand side expression is matched, together with what has matched by the left-hand side expression, a new node is left recursively added to the tree.
For example, consider the following minimal arithmetic grammar rule,
S = E | S add_op E;
add_op = "+";
E = [0-9];
Rule S is similar to @nonterminal S = F (add_op F)*;. Unlike the repetition form generating flat hierarchy, given input 1+2+3, it produces such a tree:
{
"slice": [0, 5],
"type": "S",
"children": [
{
"slice": [0, 3],
"type": "S",
"children": [
{"slice": [0, 1], "type": "E"},
{"slice": [1, 2], "type": "op"},
{"slice": [2, 3], "type": "E"}
]
},
{"slice": [3, 4], "type": "op"},
{"slice": [4, 5], "type": "E"}
]
}
Peppa PEG supports a limited form of left recursion in the sense that,
Neither lhs nor rhs expression can have left recursion. Otherwise, the implementation may run out of stack space, resulting in stack overflow.
Indirect left recursion is not allowed, e.g, the reference followed by symbol | must be identical to the rule name.
Left recursion rule cannot be wrapped into a group by using parentheses.
Full adoption of left recursion may bring ambiguity into the result parsing tree, which violates one of the PEG’s fundamental characteristics. Nonetheless, left recursion is widely used and encouraged form in some CFG parsers, such as EBNF. By making this practical decision, users can use PEG with left recursion for a certain of use cases.
Right Recursion¶
PEG supports right recursion in nature by using sequence, choice and reference to the rule itself.
For example:
pow = num "^" pow / num;
num = [1-9];
given input 1^2^3, the parser should generate such a tree:
{
"slice": [0, 5],
"type": "pow",
"children": [
{"slice": [0, 1], "type": "num"},
{
"slice": [2, 5],
"type": "pow",
"children": [
{"slice": [2, 3], "type": "num"},
{"slice": [4, 5], "type": "num"}
]
},
]
}
e.g. (1^(2^3)).
Reference¶
Reference matches a string based on the referenced grammar rule. A reference has the same specification of rule name:
name = ([a-z] / [A-Z] / "_") ([a-z] / [A-Z] / [0-9] / "_")*;
For example, greeter is just a reference rule in greeting. When matching greeting, it will use the referenced grammar rule greeter first, e.g. “Hello” / “你好”, then match ” world”.
greeting = greeter " world";
greeter = "Hello" / "你好";
The order of defining a rule does not matter.
greeter = "Hello" / "你好";
greeting = greeter " world";
One should ensure all references must have corresponding rule defined, otherwise, the parse fails due to an undefined rule.
Back Reference¶
Back reference matches an exact same string as previously matched in the sequence.
Back reference starts with a back slash, followed by a number. The number is zero-based and cannot be a number greater than or equal to the index of itself. The number indicates which previous member in the sequence should be back referenced. The full form is:
back_reference = "\" number;
In the following example, \0 matches whatever quote has matched, thus “a” or ‘a’ are valid. But “a’ or ‘a” are invalid.
str = quote [a-z] \0;
quote = "\"" / "'";
Back reference is applicable only to the named rule. Thus, the following rule can match “abaa” but not “abba”.
rule = "a" ("b" \0) \0;
Insensitive¶
Insensitive operator starts with “i” and followed by a literal or back reference.
insensitive = "i" (literal / back_reference);
For example,
Given the following rule, back reference \0 is case-insensitive. Hence, both a=A and a=a are valid.
rule = [a-z] "=" i\0;
Given the following rule, literal “hello world” is case-insensitive. Hence, both ì and Ì are valid.
rule = i"ì";
Positive¶
Positive tests if a primary expression can match. Positive starts with & and followed by a primary:
positive = "&" primary;
Positive attempts to match the primary expression. If succeeds, the test passes. Positive does not “consume” any text.
Positive can be useful in limiting the possibilities of the latter member in a Sequence. In this example, the Sequence expression must start with “Hello”. Given “Hello World”, “Hello WORLD”, “Hello world” it will match. Given “HELLO WORLD” it will not match.
greeting = &"Hello" i"hello world";
Negative¶
Negative tests if a primary expression failed to match. Negative starts with ! and followed by a primary:
negative = "!" primary;
Negative expects the primary expression doesn’t match. If fails, the test passes. Negative does not “consume” any text.
In this example, the greeting rule must not start with “Hello”. Given “HELLO World”, “hello WORLD”, “hello world”, it will match. Given “Hello World”, it will not match.
greeting = !"Hello" i"hello world";
Repetition¶
Operators +, *, ? and {} followed by an expression indicates repetition.
The full form of repetition is:
repeat = primary (onceormore / zeroormore / zerooronce / repeatexact / repeatminmax / repeatmin / repeatmax)?;
Plus (+) matches string one or more times.
onceormore = "+";
For example,
number = [0-9]+;
Asterisk (*) matches string zero or more times.
zeroormore = "*";
For example,
number = [0-9] [1-9]*;
Question (?) matches string one or more times.
zerooronce = "?";
For example,
number = [0-9] "."?;
{cnt} matches exactly cnt occurrences of an expression, where cnt is a decimal value.
repeatexact = "{" number "}";
For example,
unicode = "\U" ([0-9] / [a-f] / [A-F]){8};
{min,max} matches an expression of at least min occurrences and at most max occurrences, where min and max are decimal values.
repeatminmax = "{" number "," number "}";
For example,
hex = "\u{" ([0-9] / [a-f] / [A-F]){1,6} "}";
{min,} matches an expression of at least min occurrences, where min is a decimal value.
repeatmin = "{" number "," "}";
For example,
above_hundred = [1-9] [1-9]{2,};
{,max} matches an expression of at most max occurrences, where max is a decimal value.
repeatmax = "{" "," number "}";
For example,
below_thousand = [0-9]{,3};
Cut¶
Cut is a decorator written as “~”. It always succeeds, but cannot be backtracked.
cut = "~";
It’s used to prevent unwanted backtracking, e.g. to prevent excessive choice options.
Backtracking means if e1 in rule = e1 / e2; fails, the parser returns the last position where e1 started, and tries e2. If there is a ~ in e1, any failure after the cutting point will cause rule failed immediately. Cut ensures the parse sticks to the current rule, even if it fails to parse. See ideas 1, 2.
For example, let’s first consider the following grammar,
value = array / null;
array = "[" "]";
null = "null";
Given input “[”, it attempts matching array first. After failed, it will try null next. At last, value match is failed.
Let’s add a cut operator:
value = array / null;
array = "[" ~ "]";
null = "null";
Given input “[”, it attempts matching array first. After failed, value match is failed immediately.
Given input “null”, it attempts matching array first. It fails before ~ and then failed matching array. Parser then match “null” successfully.
Decorators¶
Decorators are characters @ followed by some selected keywords. Valid decorators include: @spaced, @squashed, @scoped, @tight, @lifted and @nonterminal.
decorator = "@" ("squashed" / "scoped" / "spaced" / "lifted" / "nonterminal");
For example,
@spaced @lifted
ws = " " / "\t" / "\n";
@spaced¶
A @spaced rule will be auto-inserted in elements of sequences and repetitions.
For example, my_sequence can match “helloworld”, “hello world”, “hello t n world”, etc.
my_sequence = "hello" "world";
@spaced
ws = " " / "\t" / "\n";
Rule my_sequence is in essence equivalent to:
my_sequence = "hello" (" " / "\t" / "\n")* "world";
There may be multiple @spaced rules in the grammar. Thus,
my_sequence2 = "hello" "world";
@spaced
ws = " " / "\t" / "\n";
@spaced
dot = ".";
Rule my_sequence is equivalent to:
my_sequence2 = "hello" ((" " / "\t" / "\n") / ".")* "world";
The @spaced rules only take effects for rules having no @tight decorators.
@tight¶
A @tight rule deters any @spaced rule from auto-inserted.
For example, my_sequence can only match “helloworld”.
@tight
my_sequence = "hello" "world";
@spaced
ws = " " / "\t" / "\n";
@lifted¶
A @lifted rule replaces the node with its children in the parse tree, if exists any children.
For example,
rule = lit;
@lifted
lit = digit / char;
number = [0-9]+;
char = ([a-z] / [A-Z])+;
given input “42”, the parsed tree looks like:
{
"type": "rule",
"children": [
{"type": "number", "text": "4"},
{"type": "number", "text": "2"}
]
}
This decorator is useful for trimming some unnecessary nodes in the parse tree.
@nonterminal¶
A @nonterminal rule replaces the node with its children in the parse tree, if exists one child. If the rule produces multiple children, this decorator has no effect.
For example,
@nonterminal
add = number ("+" number)?;
number = [0-9];
given input “1”, the tree is like:
{"type": "add", "text": "1"}
given input “1+2”, the tree is like:
{
"type": "add",
"children": [
{"type": "number", "text": "1"},
{"type": "number", "text": "2"},
]
}
@squashed¶
A @squashed rule drops all node children in the parse tree.
For example,
@squashed
float = number ("." number)?;
number = [0-9];
given input “1.0”, rule float drops all number nodes, leaving only one single node in the ast:
{"type": "float", "text": "1.0"}
@scoped¶
A @scoped rule reset all decorators that inherit from upstream rules.
For example, despite greeting2 set to not using spaced rule ws, greeting can still apply to ws since it’s under its own scope.
@tight
greeting2 = greeting greeting;
@scoped
greeting = "hello" "world";
@spaced
ws = " ";
Cheatsheet¶
Syntax |
Meaning |
---|---|
foo = …; |
grammar rule |
@lifted foo = …; |
drop node |
@spaced foo = …; |
mark as space |
@squashed foo = …; |
ignore children |
@tight foo = …; |
ignore spaced rules |
@non_terminal foo = …; |
ignore single child node |
@scoped foo = …; |
cancle effects |
“literal” |
exact match |
“x0dx0a” |
exact match by using ascii digits |
“u4f60u597D” |
exact match utf-8 characters |
i”literal” |
case-insensitive match |
[a-z] |
range |
[0-9..2] |
range with stride |
[\u0001-\U0010ffff] |
range using unicode runes |
[\p{L}] |
range using unicode categories |
. |
any character |
foo bar |
sequence |
foo / bar |
choice |
\0 |
back reference |
&foo |
positive |
!foo |
negative |
~ |
prevent unwanted backtracking |
foo* |
zero or more |
foo+ |
once or more |
foo? |
optional |
foo{m,} |
repeat at least m times |
foo{,n} |
repeat at most n times |
foo{m,n} |
repeat between m-n times |
foo{m} |
repeat exact n times |
# IGNORE |
comment |
Appendix. Peppa PEG of Peppa PEG¶
This appendix includes the full form of Peppa PEG specification.
grammar = start_of_input rule+ end_of_input;
start_of_input = &.;
end_of_input = !.;
rule = decorators name ~ "=" expression ";";
decorators = decorator*;
decorator = "@" ~ (
"squashed"
/ "scoped"
/ "spaced"
/ "lifted"
/ "tight"
/ "nonterminal"
);
@squashed
name = reference;
@lifted
expression = left_recursion;
@nonterminal
left_recursion = choice ("|" ~ reference choice)?;
@nonterminal
choice = sequence ("/" sequence)*;
@nonterminal
sequence = repeat+;
@nonterminal
repeat = primary (
onceormore
/ zeroormore
/ zerooronce
/ repeatexact
/ repeatminmax
/ repeatmin
/ repeatmax
)?;
onceormore = "+";
zeroormore = "*";
zerooronce = "?";
repeatexact = "{" number "}";
repeatminmax = "{" number "," number "}";
repeatmin = "{" number "," "}";
repeatmax = "{" "," number "}";
@lifted
primary = literal
/ insensitive
/ range
/ (reference !"=")
/ back_reference
/ positive
/ negative
/ "(" choice ")"
/ dot
/ cut;
@squashed @tight
literal = "\"" ~ chars "\"";
chars = char*;
@squashed @tight
char = [\x20-\x21] / [\x23-\x5b] / [\x5d-\U0010ffff] / (
"\\" (
"\""
/ "/"
/ "\\"
/ "b"
/ "f"
/ "n"
/ "r"
/ "t"
/ ("x" ~ two_hexdigits)
/ ("u" ~ four_hexdigits)
/ ("U" ~ eight_hexdigits)
)
);
@squashed @tight
range_category = ([a-z] / [A-Z] / [0-9] / "_" / " ")+;
range = "[" (
(char "-" char (".." number)?)
/ ("\\p{" range_category "}")
) "]";
@tight
insensitive = "i" (literal / back_reference);
@squashed @tight
reference = ([a-z] / [A-Z] / "_") ([a-z] / [A-Z] / [0-9] / "_")*;
@tight
back_reference = "\\" ~ number;
positive = "&" ~ primary;
negative = "!" ~ primary;
dot = ".";
cut = "~";
hexdigit = [0-9] / [a-f] / [A-F];
two_hexdigits = hexdigit{2};
four_hexdigits = hexdigit{4};
eight_hexdigits = hexdigit{8};
@squashed @tight
number = "0" / [1-9] [0-9]*;
@lifted @spaced
comment = "#" (!"\n" .)* "\n"?;
@lifted @spaced
whitespace = " " / "\t" / "\r" / "\n";