Low Level APIs

Use Peg API

Function P4_LoadGrammar() can load a grammar from a string.

P4_Grammar* grammar = P4_LoadGrammar(
    "add = int + int;"

    "@squashed @tight "
    "int = [0-9]+;"

    "@spaced @lifted "
    "ws  = \" \";";
);

The one-statement code is somewhat equivalent to the below code written in low-level C API:

P4_Grammar* grammar = P4_CreateGrammar();

if (P4_Ok != P4_AddSequenceWithMembers(grammar, RuleAdd, 3,
    P4_CreateReference(RuleInt),
    P4_CreateLiteral("+", true),
    P4_CreateReference(RuleInt)
))
    goto finalize;

if (P4_Ok != P4_AddOnceOrMore(grammar, RuleInt, P4_CreateRange('0', '9', 1)))
    goto finalize;
if (P4_Ok != P4_SetGrammarRuleFlag(grammar, RuleInt, P4_FLAG_SQUASHED|P4_FLAG_TIGHT))
    goto finalize;

if (P4_Ok != P4_AddLiteral(grammar, RuleWs, " ", true))
    goto finalize;
if (P4_Ok != P4_SetGrammarRuleFlag(grammar, RuleWs, P4_FLAG_SPACED|P4_FLAG_LIFTED))
    goto finalize;

Expressions

Literal

Literal matches an exact string.

P4_Expression* expr = P4_CreateLiteral("Hello world", true);

Literal can be case insensitive for ASCII characters, if the option sensitive is false.

P4_Expression* expr = P4_CreateLiteral("HELLO WORLD", false);

Literal can be in UTF-8 format.

P4_Expression* expr = P4_CreateLiteral("你好, 世界", true);

Literal can include emoji characters.

P4_Expression* expr = P4_CreateLiteral("Peppa 🐷", false);

Literal cannot be an empty string.

// WRONG!
P4_Expression* expr = P4_CreateLiteral("", true);
seealso

P4_Literal, P4_CreateLiteral(), P4_AddLiteral().

Range

Range matches a single character in range.

P4_Expression* expr = P4_CreateRange('0', '9', 1);

The lower and upper character of range can be ASCII characters or UTF-8 runes.

P4_Expression* expr = P4_CreateRange(0x4E00, 0x9FFF, 1);

Range can be used to match anything (0, or ‘0’, or EOF is used for terminating a string in C):

P4_Expression* expr = P4_CreateRange(1, 0x10ffff, 1);

The stride parameter allows larger steps when iterating characters in range.

// Only match 1, 3, 5, 7, 9.
P4_Expression* odd = P4_CreateRange('1', '9', 2);

The value of lower must be less than the upper.

// WRONG!
P4_Expression* expr = P4_CreateRange('z', 'a', 1);

Range can be an aggregation of several sub-ranges. The below example is equivalent to a choice of 3 ranges.

P4_RuneRange alphadigits[] = {{'a', 'z', 1}, {'A', 'Z', 1}, {'0', '9', 1}};
P4_Expression* expr = P4_CreateRanges(3, alphadigits);
seealso

P4_Range, P4_CreateRange(), P4_AddRange(), P4_CreateRanges(), P4_AddRanges().

Sequence

Sequence matches a sequence of sub-expressions in order.

P4_Expression* expr = P4_CreateSequenceWithMembers(3,
    P4_CreateLiteral("Hello", true),
    P4_CreateLiteral(",", true),
    P4_CreateLiteral("World", false)
);

When parsing, the first sequence member is attempted. If succeeds, the second is attempted, so on and on. If any one of the attempts fails, the match fails.

Hello,WORLD
_____ Literal "Hello", success!
     _ Literal ",", success!
      _____ Insensitive Literal "World", success!
___________ Sequence: success!

Hello,UNIVERSE
_____ Literal "Hello", success!
     _ Literal ",", success!
      _____ Insensitive Literal "World", failure!
Sequence: failure!

The members can be set after the Sequence is created:

P4_Expression* expr = P4_CreateSequence(3);

if (expr == NULL) goto oom;

if (P4_SetMember(expr, 0, P4_CreateLiteral("Hello", true)) != P4_Ok) goto oom;
if (P4_SetMember(expr, 1, P4_CreateLiteral(",", true)) != P4_Ok) goto oom;
if (P4_SetMember(expr, 2, P4_CreateLiteral("World", false)) != P4_Ok) goto oom;

oom: P4_DeleteExpression(expr);
seealso

P4_Sequence, P4_CreateSequence(), P4_CreateSequenceWithMembers(), P4_AddSequence(), P4_AddSequenceWithMembers(), P4_SetMember().

BackReference

BackReference matches an exact string previously matched in a Sequence. BackReference can and only can be used as a Sequence member. For example, the below snippet matches “a:a”, but not “a:A” or “a:b”.

P4_Expression* expr = P4_CreateSequenceWithMembers(3,
    P4_CreateRange('a', 'z', 1);
    P4_CreateLiteral(":", true),
    P4_CreateBackReference(0, true)
);

The BackReference can be case insensitive, regardless whether the original match was case sensitive. For example, the below snippet matches “a:a” and “a:A”.

P4_Expression* expr = P4_CreateSequenceWithMembers(3,
    P4_CreateRange('a', 'z', 1);
    P4_CreateLiteral(":", true),
    P4_CreateBackReference(0, false)
);

The index value of a BackReference must be less than the total number of members in a Sequence.

// WRONG!
P4_Expression* expr = P4_CreateSequenceWithMembers(3,
    P4_CreateLiteral("a", true),
    P4_CreateLiteral(":", true),
    P4_CreateBackReference(3, true)
);

The index value of a BackReference must not be the index of itself.

// WRONG!
P4_Expression* expr = P4_CreateSequenceWithMembers(3,
    P4_CreateLiteral("a", true),
    P4_CreateLiteral(":", true),
    P4_CreateBackReference(2, true)
);
seealso

P4_BackReference, P4_CreateBackReference().

Choice

Choice matches one of the sub-expression.

P4_Expression* expr = P4_CreateChoiceWithMembers(3,
    P4_CreateLiteral("Hello", true),
    P4_CreateLiteral("Kia Ora", true),
    P4_CreateLiteral("你好", false)
);

When parsing, the first sequence member is attempted. If fails, the second is attempted, so on and on. If any one of the attempts succeeds, the match succeeds. If all attempts fail, the match fails.

你好
Literal "Hello", failure!
Literal "Kia Ora", failure|
____ Literal "你好", success!
____ Choice: success!

Ciao
Literal "Hello", failure!
Literal "Kia Ora", failure|
Literal "你好", failure!
Choice: failure!

Similar to Sequence, the members can be set after the Choice is created.

P4_Expression* expr = P4_CreateChoice(3);

if (expr == NULL) goto oom;

if (P4_SetMember(expr, 0, P4_CreateLiteral("Hello", true)) != P4_Ok) goto oom;
if (P4_SetMember(expr, 1, P4_CreateLiteral("Kia Ora", true)) != P4_Ok) goto oom;
if (P4_SetMember(expr, 2, P4_CreateLiteral("你好", true)) != P4_Ok) goto oom;

oom: P4_DeleteExpression(expr);
seealso

P4_Choice, P4_CreateChoice(), P4_CreateChoiceWithMembers(), P4_AddChoice(), P4_AddChoiceWithMembers(), P4_SetMember().

Reference

Reference matches a string based on the referenced grammar rule.

A grammar includes a set of grammar rules. Each grammar rule is built from P4_Expression and is associated with an id. A grammar rule can then be referenced in other grammar rules.

P4_AddLiteral(grammar, "text", "Hello,WORLD", true);

P4_Expression* expr = P4_CreateReference("text");

The referenced grammar rule must exist before calling P4_Parse().

seealso

P4_Reference, P4_CreateReference(), P4_AddReference().

Positive

Positive tests if the sub-expression matches.

P4_Expression* expr = P4_CreatePositive(P4_CreateLiteral("Hello", true));

Positive attempts to match the sub-expression. If succeeds, the test passes. Positive does not “consume” any text.

Positive can be useful in limiting the possibilities of the latter member in a Sequence. In this example, the Sequence expression must start with “Hello”, e.g. “Hello World”, “Hello WORLD”, “Hello world”, etc, will match but “HELLO WORLD” will not match.

P4_Expression* expr = P4_CreateSequenceWithMembers(2,
    P4_CreatePositive(
        P4_CreateLiteral("Hello", true)
    );
    P4_CreateLiteral("Hello World", false)
);
seealso

P4_Positive, P4_CreatePositive(), P4_AddPositive().

Negative

Negative tests if the sub-expression does not match.

P4_Expression* expr = P4_CreateNegative(P4_CreateLiteral("Hello", true));

Negative expects the sub-expression doesn’t match. If fails, the test passes. Negative does not “consume” any text.

Negative can be useful in limiting the possiblities of the latter member in a Sequence. In this example, the Sequence expression must not start with “Hello”, e.g. “HELLO World”, “hello WORLD”, “hello world”, etc, will match but “Hello World” will not match.

P4_Expression* expr = P4_CreateSequenceWithMembers(2,
    P4_CreateNegative(
        P4_CreateLiteral("Hello", true)
    );
    P4_CreateLiteral("Hello World", false)
);
seealso

P4_Negative, P4_CreateNegative(), P4_AddNegative().

Cut

Cut matches everything but avoids unwanted backtracking.

For example,

P4_Expression* value = P4_CreateChoiceWithMembers(2,
    P4_CreateReference("array"),
    P4_CreateReference("null")
);
P4_Expression* value = P4_CreateSequenceWithMembers(3,
    P4_CreateLiteral("[", true),
    P4_CreateCut(),
    P4_CreateLiteral("]", true)
);
P4_Expression* value = P4_CreateLiteral("null", true);

Repeat

Repeat matches the sub-expression several times.

ZeroOrOnce, ZeroOrMore, OnceOrMore consume zero or one , zero or more, or one or more consecutive repetitions of their sub-expression.

P4_Expression* expr = P4_CreateZeroOrOnce(P4_CreateLiteral("Hello", true));
P4_Expression* expr = P4_CreateZeroOrMore(P4_CreateLiteral("Hello", true));
P4_Expression* expr = P4_CreateOnceOrMore(P4_CreateLiteral("Hello", true));

ZeroOrOnce and ZeroOrMore always succeeds because it allows matching zero times.

The repetition can also be set with designated min or max times.

P4_Expression* expr = P4_CreateRepeatMin(P4_CreateLiteral("Hello", true), 3);
P4_Expression* expr = P4_CreateRepeatMax(P4_CreateLiteral("Hello", true), 3);
P4_Expression* expr = P4_CreateRepeatExact(P4_CreateLiteral("Hello", true), 3);
P4_Expression* expr = P4_CreateRepeatMinMax(P4_CreateLiteral("Hello", true), 1, 3);

Note

All Repeat expressions can be rewritten with P4_CreateRepeatMinMax.

  • ZeroOrOnce: P4_CreateRepeatMinMax(expr, 0, 1);

  • ZeroOrMore: P4_CreateRepeatMinMax(expr, 0, SIZE_MAX);

  • OnceOrMore: P4_CreateRepeatMinMax(expr, 1, SIZE_MAX);

  • RepeatMin: P4_CreateRepeatMinMax(expr, min, SIZE_MAX);

  • RepeatMax: P4_CreateRepeatMinMax(expr, 0, max);

  • RepeatExact: P4_CreateRepeatMinMax(expr, n, n);

However, using the derived names can improve the readability of the code.

seealso

P4_Repeat, P4_CreateZeroOrOnce(), P4_CreateZeroOrMore(), P4_CreateOnceOrMore(), P4_CreateRepeatMin(), P4_CreateRepeatMax(), P4_CreateRepeatMinMax(), P4_CreateRepeatExact(), P4_AddZeroOrOnce(), P4_AddZeroOrMore(), P4_AddOnceOrMore(), P4_AddRepeatMin(), P4_AddRepeatMax(), P4_AddRepeatMinMax(), P4_AddRepeatExact().

Expression Flags

Flags can only be used in the grammar rule expression itself. It can not be used in any sub-expression of the grammar rule expression.

Usually, a grammar rule creates a node. The expression flags modify the behavior of the node generation.

P4_FLAG_SQUASHED

Flag P4_FLAG_SQUASHED prevents generating children nodes.

For example,

P4_AddSequenceWithMembers(grammar, "entry", 2
    P4_CreateReference(Text),
    P4_CreateReference(Text)
);
P4_AddLiteral(grammar, Text, "x", false);

This grammar parses the text “xx” into three nodes:

Token(0..2): "Xx"
  Token(0..1) "X"
  Token(1..2) "x"

If the grammar rule “entry” has flag P4_FLAG_SQUASHED, the children nodes disappear:

P4_SetGrammarRuleFlag(grammar, "entry", P4_FLAG_SQUASHED);
Token(0..2): "Xx"

Flag P4_FLAG_SQUASHED takes effects not only on the expression but its all descendant expressions.

P4_FLAG_LIFTED

P4_FLAG_LIFTED replaces the node with its children nodes.

For example,

P4_AddSequenceWithMembers(grammar, "entry", 2
    P4_CreateReference(Text),
    P4_CreateReference(Text)
);
P4_AddLiteral(grammar, Text, "x", false);

This grammar can match text “xx” into three nodes:

Token(0..2): "Xx"
  Token(0..1) "X"
  Token(1..2) "x"

If the grammar rule “entry” has flag P4_FLAG_LIFTED, the node is lifted and replaced by its children:

P4_SetGrammarRuleFlag(grammar, "entry", P4_FLAG_LIFTED);
Token(0..1): "X"
Token(1..2): "x"

P4_FLAG_NON_TERMINAL

P4_FLAG_NON_TERMINAL replaces the node with its single child node or does nothing..

For example,

P4_AddSequenceWithMembers(grammar, "entry", 2
    P4_CreateLiteral("(", true),
    P4_CreateReference(Text),
    P4_CreateLiteral(")", true)
);
P4_AddLiteral(grammar, Text, "x", false);

This grammar can match text “(x)” into two nodes:

Token(0..3): "(x)"
  Token(1..2) "x"

If the grammar rule “entry” has flag P4_FLAG_NON_TERMINAL, the node is lifted and replaced by its single child node:

P4_SetGrammarRuleFlag(grammar, "entry", P4_FLAG_NON_TERMINAL);
Token(1..2): "x"

This flag only works for Sequence and Repeat expressions.

This flag has no effect if the Sequence or Repeat expressions produces over one node, e.g, the parent node will not be lifted.

P4_FLAG_SPACED

P4_FLAG_SPACED indicates the expression is for whitespaces.

For example,

P4_AddLiteral(grammar, "whitespace", " ", false);

P4_SetGrammarRuleFlag(grammar, "whitespace", P4_FLAG_SPACED);

Often, we don’t want the whitespace having nodes, so it’s usually combined with P4_FLAG_LIFTED.

P4_SetGrammarRuleFlag(grammar, "whitespace", P4_FLAG_SPACED | P4_FLAG_LIFTED);

This flag does not work on its own. It takes effect on Sequence or Repeat.

When parsing Sequence and Repeat, the grammar will match as many whitespaces as possible between every sequence member or every repetition sub-expression.

For example, this rule matches “HelloWorld”, “Hello World”, “Hello World”, etc.

P4_AddSequenceWithMembers(grammar, "entry", 2,
    P4_AddLiteral("Hello", true),
    P4_AddLiteral("World", true)
);

For example, this rule matches “xxx”, “x x x”, etc.

P4_AddOnceOrMore(grammar, "entry", P4_AddLiteral("x", true));

The SPACED expressions are not inserted before or after the Sequence and Repeat, hence ” Hello World “, ” xxx ” not matching.

P4_FLAG_TIGHT

P4_FLAG_TIGHT prevents inserting the `P4_FLAG_SPACED` expressions. This tag only works for Sequence and Repeat.

Given the above P4_FLAG_SPACED expression, if we set the grammar rule with flag P4_FLAG_TIGHT, the SPACED expressions are not inserted.

P4_SetGrammarRuleFlag(grammar, "entry", P4_FLAG_TIGHT);

Peppa PEG applies SPACED expressions on every Sequence and Repeat unless a P4_FLAG_TIGHT is explicitly specified on a Sequence or Repeat.

Flag P4_FLAG_TIGHT takes effects not only on the expression but its all descendant expressions.

P4_FLAG_SCOPED

P4_FLAG_SCOPED prevents the effect of `P4_FLAG_SQUASHED` and `P4_FLAG_SPACED`.

P4_SetGrammarRuleFlag(grammar, "entry", P4_FLAG_SCOPED);

Starting from the SCOPED grammar rule, the nodes are not squashed; the implicit whitespace is enabled as well.